对人工智能基础数学的赞扬

Praise for Essential Math for AI

技术和人工智能市场就像一条河流,其中某些部分的流动速度比其他部分快。成功应用人工智能需要具备评估流程方向的技能,并为其提供坚实的基础,而本书以一种引人入胜、令人愉快和包容的方式实现了这一点。Hala 为人工智能未来的众多参与者带来了数学的乐趣!

Adri Purkayastha,法国巴黎银行人工智能操作风险和数字风险分析部门主管

Technology and AI markets are like a river, where some parts are moving faster than others. Successfully applying AI requires the skill of assessing the direction of the flow and complementing it with a strong foundation, which this book enables, in an engaging, delightful, and inclusive way. Hala has made math fun for a spectrum of participants in the AI-enabled future!

Adri Purkayastha, Group Head, AI Operational Risk and Digital Risk Analytics, BNP Paribas

有关人工智能的文本通常要么是专家为其他专家撰写的技术手稿,要么是针对普通读者的粗略、无数学的介绍。本书开辟了第三条令人耳目一新的道路,为没有高等数学学位的商业、数据和类似领域的读者介绍了数学基础。作者在全文中编织了优雅的方程和精辟的观察,同时要求读者考虑人工智能对社会的非常严重的影响。我向任何寻求从实际角度严格对待人工智能基础知识的人推荐《人工智能基础数学》 。

George Mount,数据分析师和教育家

Texts on artificial intelligence are usually either technical manuscripts written by experts for other experts, or cursory, math-free introductions catered to general audiences. This book takes a refreshing third path by introducing the mathematical foundations for readers in business, data, and similar fields without advanced mathematics degrees. The author weaves elegant equations and pithy observations throughout, all the while asking the reader to consider the very serious implications artificial intelligence has on society. I recommend Essential Math for AI to anyone looking for a rigorous treatment of AI fundamentals viewed through a practical lens.

George Mount, Data Analyst and Educator

哈拉在解释重要的数学概念方面做得非常出色。这是每一位认真的机器学习从业者的必读之作当你读完这本书后,你会更加热爱这个领域。

Umang Sharma,高级数据科学家兼作家

Hala has done a great job in explaining crucial mathematical concepts. This is a must-read for every serious machine learning practitioner. You’d love the field more once you go through the book.

Umang Sharma, Senior Data Scientist and Author

要了解人工智能,需要了解数学和人工智能之间的关系。尼尔森博士为我们提供了建立两个学科之间共生关系的基础,使这一切变得容易。

Huan Nguyen,海军少将(退役),NAVSEA 网络工程

To understand artificial intelligence, one needs to understand the relationship between math and AI. Dr. Nelson made this easy by giving us the foundation on which the symbiotic relationship between the two disciplines is built.

Huan Nguyen, Rear Admiral (Ret.), Cyber Engineering, NAVSEA

人工智能的基础数学

Essential Math for AI

高效、成功的人工智能系统的高级数学

Next-Level Mathematics for Efficient and Successful AI Systems

哈拉·尼尔森

Hala Nelson

人工智能的基础数学 

Essential Math for AI 

通过哈拉·纳尔逊

by Hala Nelson

美国印刷。

Printed in the United States of America.

由O'Reilly Media, Inc.出版,地址:1005 Gravenstein Highway North, Sebastopol, CA 95472。

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

购买 O'Reilly 书籍可用于教育、商业或促销用途。大多数图书也提供在线版本 ( http://oreilly.com )。欲了解更多信息,请联系我们的企业/机构销售部门:800-998-9938或corporate@oreilly.com

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

  • 收购编辑: Aaron Black
  • Acquisitions Editor: Aaron Black
  • 发展编辑:安吉拉·鲁菲诺
  • Development Editor: Angela Rufino
  • 制作编辑:克里斯汀·布朗
  • Production Editor: Kristen Brown
  • 文案编辑:索尼娅·萨鲁巴
  • Copyeditor: Sonia Saruba
  • 校对: JM Olejarz
  • Proofreader: JM Olejarz
  • 索引器: nSight, Inc.
  • Indexer: nSight, Inc.
  • 室内设计师:大卫·富塔托
  • Interior Designer: David Futato
  • 封面设计:凯伦·蒙哥马利
  • Cover Designer: Karen Montgomery
  • 插画师:凯特·杜拉
  • Illustrator: Kate Dullea
  • 2023 年 1 月:第一版
  • January 2023: First Edition

第一版的修订历史

Revision History for the First Edition

  • 2023-01-04:首次发布
  • 2023-01-04: First Release

有关发布详细信息,请参阅http://oreilly.com/catalog/errata.csp?isbn=9781098107635

See http://oreilly.com/catalog/errata.csp?isbn=9781098107635 for release details.

前言

Preface

我为什么写这本书

Why I Wrote This Book

人工智能建立在数学模型之上。我们需要知道如何做。

AI is built on mathematical models. We need to know how.

我用纯粹的口语语言写了这本书,省略了大部分技术细节。这是一本关于人工智能的数学书,数学公式和方程很少,没有定理,没有证明,也没有编码。我的目标是不让这些重要的知识掌握在极少数精英手中,并吸引更多的人进入技术领域。我相信很多人在有机会知道自己可能喜欢数学并且天生擅长数学之前就对数学失去了兴趣。这种情况也发生在大学或研究生院,许多学生从数学转专业,或者开始攻读博士学位。并且永远不会完成它。原因并不是他们没有能力,而是他们没有看到学习痛苦的方法和技术的动力或最终目标,而这些方法和技术似乎并没有转化为对他们的生活有用的任何东西。这就像每天去一个艰苦的心理健身房只是为了去那里。甚至没有人愿意每天去真正的健身房(这是一种有偏见的说法,但你明白了)。在数学中,将对象形式化为函数、空间、测量空间和整个数学领域是在动机之后,而不是之前。不幸的是,它的教学方式是相反的,首先是形式,然后,如果我们幸运的话,一些动机。

I wrote this book in purely colloquial language, leaving most of the technical details out. It is a math book about AI with very few mathematical formulas and equations, no theorems, no proofs, and no coding. My goal is to not keep this important knowledge in the hands of the very few elite, and to attract more people to technical fields. I believe that many people get turned off by math before they ever get a chance to know that they might love it and be naturally good at it. This also happens in college or in graduate school, where many students switch their majors from math, or start a Ph.D. and never finish it. The reason is not that they do not have the ability, but that they saw no motivation or an end goal for learning torturous methods and techniques that did not seem to transfer to anything useful in their lives. It is like going to a strenuous mental gym every day only for the sake of going there. No one even wants to go to a real gym every day (this is a biased statement, but you get the point). In math, formalizing objects into functions, spaces, measure spaces, and entire mathematical fields comes after motivation, not before. Unfortunately, it gets taught in reverse, with formality first and then, if we are lucky, some motivation.

数学最美妙的地方在于它具有将看似不同的事物连接在一起的表达能力。像人工智能这样庞大和重要的领域不仅建立在数学之上,这是理所当然的;它还需要只有数学才能提供的约束力,以便简洁地讲述其大故事。在本书中,我将以一种完全不偏离现实生活中人工智能应用的方式提取人工智能所需的数学知识。详细浏览现有工具而不陷入百科全书式和压倒性的处理是不可能的。相反,我所做的是尝试教您如何思考这些工具并从上方看待它们,作为达到目的的一种手段,我们可以在需要时进行调整和调整。我希望你能从本书中了解事物之间的相互关系,以及我们为什么开发或使用某些方法。在某种程度上,本书提供了一个平台,可以让你进入任何你感兴趣或想要专攻的领域。

The most beautiful thing about math is that it has the expressive ability to connect seemingly disparate things together. A field as big and as consequential as AI not only builds on math, as that is a given; it also needs the binding ability that only math can provide in order to tell its big story concisely. In this book I will extract the math required for AI in a way that does not deviate at all from the real-life AI application in mind. It is infeasible to go through existing tools in detail and not fall into an encyclopedic and overwhelming treatment. What I do instead is try to teach you how to think about these tools and view them from above, as a means to an end that we can tweak and adjust when we need to. I hope that you will get out of this book a way of seeing how things relate to each other and why we develop or use certain methods among others. In a way, this book provides a platform that launches you to whatever area you find interesting or want to specialize in.

本书的另一个目标是使数学民主化,并建立更多的信心来询问事物是如何运作的。“这是复杂的数学”、“这是复杂的技术”或“这是复杂的模型”等常见答案不再令人满意,特别是因为基于数学模型的技术目前影响着我们生活的各个方面。我们不需要成为数学各个领域的专家(没有人是)才能理解事物是如何构建的以及它们为何以它们的方式运行。每个人都需要知道数学模型的一件事:它们总是给出答案。他们总是吐出一个数字。经过审查、验证并有合理理论支持的模型给出了答案。另外,一个完全垃圾的模型给出了答案。两者都计算数学函数。说我们的决策是基于数学模型和算法并不意味着它们神圣。模型是建立在什么基础上的?他们的假设是什么?限制?他们接受训练的数据?测试过?他们考虑了哪些变量?他们遗漏了什么?他们是否有用于改进的反馈循环、可供比较和改进的基本事实?有什么理论支持它们吗?当模特是我们的时候,我们需要对这些信息保持透明;当模特为我们决定我们的生计时,我们需要询问这些信息。

Another goal of this book is to democratize mathematics, and to build more confidence to ask about how things work. Common answers such as “It’s complicated mathematics,” “It’s complicated technology,” or “It’s complex models,” are no longer satisfying, especially since the technologies that build on mathematical models currently affect every aspect of our lives. We do not need to be experts in every field in mathematics (no one is) in order to understand how things are built and why they operate the way they do. There is one thing about mathematical models that everyone needs to know: they always give an answer. They always spit out a number. A model that is vetted, validated, and backed with sound theory gives an answer. Also, a model that is complete trash gives an answer. Both compute mathematical functions. Saying that our decisions are based on mathematical models and algorithms does not make them sacred. What are the models built on? What are their assumptions? Limitations? Data they were trained on? Tested on? What variables did they take into account? And what did they leave out? Do they have a feedback loop for improvement, ground truths to compare to and improve on? Is there any theory backing them up? We need to be transparent with this information when the models are ours, and ask for it when the models are deciding our livelihoods for us.

本书中主题的非正统组织是有意为之的。我想避免在接触适用的内容之前陷入数学细节。我的立场是,除非我们碰巧亲自练习某些东西,否则我们不需要深入研究背景材料,而背景材料会成为我们知识中未填补的空白,阻碍我们取得进步。只有这样,才值得投入大量时间来了解事物的复杂细节。更重要的是了解它们是如何联系在一起的以及一切都适合的地方。换句话说,这本书提供了一张地图,展示了数学和人工智能之间的一切如何完美地相互作用。

The unorthodox organization of the topics in this book is intentional. I wanted to avoid getting stuck in math details before getting to the applicable stuff. My stand on this is that we do not ever need to dive into background material unless we happen to be personally practicing something, and that background material becomes an unfulfilled gap in our knowledge that is stopping us from making progress. Only then it is worth investing serious time to learn the intricate details of things. It is much more important to see how it all ties together and where everything fits. In other words, this book provides a map for how everything between math and AI interacts nicely together.

我也想给新人提一下大数据集时代。在处理大数据(真实的或模拟的、结构化的或非结构化的)之前,我们可能认为计算机和互联网是理所当然的。如果我们提出一个模型或需要对小型且精选的数据集进行分析,我们可能会假设我们的机器硬件将处理计算,或者互联网只会在我们需要时提供更多精选数据或更多信息关于类似型号。访问数据的现实和限制、数据中的错误、查询输出中的错误、硬件限制、存储、设备之间的数据流以及矢量化非结构化数据(例如自然语言或图像和电影)给我们带来了很大的打击。从那时起,我们开始进入并行计算、云计算、数据管理、数据库、数据结构、数据架构和数据工程,以了解允许我们运行模型的计算基础设施。我们有什么样的基础设施?它的结构如何?它是如何演变的?它要去哪里?建筑是什么样的,包括所涉及的固体材料?这些材料如何发挥作用?量子计算有什么大惊小怪的?我们不应该将软件与硬件分开,或者将我们的模型与允许我们模拟它们的基础设施分开。本书仅关注数学、人工智能模型和一些数据。既没有练习,也没有编码。换句话说,我们专注于柔软的理智的我不需要触及事物的任何一面。但我们需要不断学习,直到我们能够理解为我们生活的许多方面提供动力的技术,它实际上是一个相互关联的整体:硬件、软件、传感器和测量设备、数据仓库、连接电缆、无线集线器、卫星、通信中心、物理和软件安全措施以及数学模型。

I also want to make a note to newcomers about the era of large data sets. Before working with large data, real or simulated, structured or unstructured, we might have taken computers and the internet for granted. If we came up with a model or needed to run analytics on small and curated data sets, we might have assumed that our machine’s hardware would handle the computations, or that the internet would just give more curated data when we needed it, or more information about similar models. The reality and limitations to access data, errors in the data, errors in the outputs of queries, hardware limitations, storage, data flow between devices, and vectorizing unstructured data such as natural language or images and movies hits us really hard. That is when we start getting into parallel computing, cloud computing, data management, databases, data structures, data architectures, and data engineering in order to understand the compute infrastructure that allows us to run our models. What kind of infrastructure do we have? How is it structured? How did it evolve? Where is it headed? What is the architecture like, including the involved solid materials? How do these materials work? And what is all the fuss about quantum computing? We should not view the software as separate from the hardware, or our models separate from the infrastructure that allows us to simulate them. This book focuses only on the math, the AI models, and some data. There are neither exercises nor coding. In other words, we focus on the soft, the intellectual, and the I do not need to touch anything side of things. But we need to keep learning until we are able to comprehend the technology that powers many aspects of our lives as the one interconnected body that it actually is: hardware, software, sensors and measuring devices, data warehouses, connecting cables, wireless hubs, satellites, communication centers, physical and software security measures, and mathematical models.

这本书是给谁的?

Who Is This Book For?

我写这本书的目的是:

I wrote this book for:

  • 懂数学但想进入人工智能、机器学习和数据科学领域的人。

  • A person who knows math but wants to get into AI, machine learning, and data science.

  • 从事人工智能、数据科学和机器学习工作,但想要温习数学思维并了解最先进模型背后的数学思想的人。

  • A person who practices AI, data science, and machine learning but wants to brush up on their mathematical thinking and get up-to-date with the mathematical ideas behind the state-of-the-art models.

  • 对人工智能感兴趣的数学、数据科学、计算机科学、运筹学、科学、工程或其他领域的本科生或早期研究生。

  • Undergraduate or early graduate students in math, data science, computer science, operations research, science, engineering, or other domains who have an interest in AI.

  • 处于管理职位的人们希望将人工智能和数据分析集成到他们的运营中,但希望更深入地了解他们最终可能根据实际决策所建立的模型如何发挥作用。

  • People in management positions who want to integrate AI and data analytics into their operations but want a deeper understanding of how the models that they might end up basing their decisions on actually work.

  • 主要从事商业智能工作的数据分析师,现在和世界其他地方一样,被推动进入人工智能驱动的商业智能领域。在将其纳入业务决策之前,他们想知道这实际上意味着什么。

  • Data analysts who are primarily doing business intelligence, and are now, like the rest of the world, driven into AI-powered business intelligence. They want to know what that actually means before adopting it into business decisions.

  • 关心人工智能可能给世界带来的道德挑战并希望了解模型的内部运作原理的人,以便他们可以支持或反对某些问题,例如自主武器、定向广告、数据管理等。

  • People who care about the ethical challenges that AI might pose to the world and want to understand the inner workings of the models so that they can argue for or against certain issues such as autonomous weapons, targeted advertising, data management, etc.

  • 想要整合数学和人工智能课程的教育工作者。

  • Educators who want to put together courses on math and AI.

  • 任何对人工智能感兴趣的人。

  • Any person who is curious about AI.

这本书不适合谁?

Who Is This Book Not For?

这本书不适合那些喜欢坐下来做大量练习来掌握特定数学技术或方法的人,也不适合那些喜欢编写和证明定理的人,或者想要学习编码和开发的人。这不是数学教科书。有许多优秀的教科书教授微积分、线性代数和概率(但很少有书将这些数学与人工智能联系起来)。也就是说,本书为想要深​​入研究技术细节、严格的陈述和证明的读者提供了许多相关书籍和科学出版物的文内提示。这也不是一本编码书。重点是概念、直觉和一般理解,而不是实施和开发技术。

This book is not for a person who likes to sit down and do many exercises to master a particular mathematical technique or method, a person who likes to write and prove theorems, or a person who wants to learn coding and development. This is not a math textbook. There are many excellent textbooks that teach calculus, linear algebra, and probability (but few books relate this math to AI). That said, this book has many in-text pointers to the relevant books and scientific publications for readers who want to dive into technicalities, rigorous statements, and proofs. This is also not a coding book. The emphasis is on concepts, intuition, and general understanding, rather than on implementing and developing the technology.

本书将如何介绍数学?

How Will the Math Be Presented in This Book?

写书最终是一个决策过程:如何以最深入该领域的方式组织主题材料,以及如何选择要阐述的内容和不阐述的内容。我将在一些地方详细介绍一些数学知识,并在其他地方省略细节。这是故意的,因为我的目标是不因讲述以下故事而分心:

Writing a book is ultimately a decision-making process: how to organize the material of the subject matter in a way that is most insightful into the field, and how to choose what and what not to elaborate on. I will detail some math in a few places, and I will omit details in others. This is on purpose, as my goal is to not get distracted from telling the story of:

我们使用哪种数学,为什么需要它,以及我们在人工智能中到底在哪里使用它?

Which math do we use, why do we need it, and where exactly do we use it in AI?

我总是定义人工智能的背景和许多应用。然后我讲相关的数学,有时讲细节,有时只讲一般的思维方式。每当我跳过细节时,我都会指出我们应该提出的相关问题以及如何寻找答案。我将数学、人工智能和模型作为一个相互关联的实体进行展示。只有当数学必须成为基础的一部分时,我才会更深入地研究它。即便如此,我还是更喜欢直觉而不是形式。我在这里付出的代价是,在极少数情况下,我可能会在定义它们之前使用一些技术术语,暗自希望你以前可能遇到过这些术语。从这个意义上说,我采用 AI 的Transformer哲学(参见Google Brain 2017 年的文章:“Attention Is All You Need”)来进行自然语言理解:模型从上下文中学习单词含义。所以当你遇到我之前没有定义过的技术术语时,要关注该术语的周围环境。在它出现的部分的过程中,您将对它的含义有一个很好的直觉。当然,另一个选择是用谷歌搜索。总的来说,我避免使用行话,并且使用零首字母缩略词。

I always define the AI context, with many applications. Then I talk about the related mathematics, sometimes with details and other times only with the general way of thinking. Whenever I skip details, I point out the relevant questions that we should be asking and how to go about finding answers. I showcase the math, the AI, and the models as one connected entity. I dive deeper into math only if it must be part of the foundation. Even then, I favor intuition over formality. The price I pay here is that on very few occasions, I might use some technical terms before defining them, secretly hoping that you might have encountered these terms before. In this sense, I adopt AI’s transformer philosophy (see Google Brain’s 2017 article: “Attention Is All You Need”) for natural language understanding: the model learns word meanings from their context. So when you encounter a technical term that I have not defined before, focus on the term’s surrounding environment. Over the course of the section within which it appears, you will have a very good intuition about its meaning. The other option, of course, is to google it. Overall, I avoid jargon and I use zero acronyms.

由于这本书处于数学、数据科学、人工智能、机器学习和哲学的交叉点,所以我写这本书时期望拥有不同技能和背景的不同读者。因此,根据主题的不同,相同的材料可能对某些人来说微不足道,但对其他人来说可能很复杂。我希望在此过程中我不会侮辱任何人的思想。也就是说,这是我愿意承担的风险,以便所有读者都能从这本书中找到有用的东西。例如,数学家将学习人工智能应用,数据科学家和人工智能从业者将学习更深入的数学。

Since this book lies at the intersection of math, data science, AI, machine learning, and philosophy, I wrote it expecting a diverse audience with drastically different skill sets and backgrounds. For this reason, depending on the topic, the same material might feel trivial to some but complicated to others. I hope I do not insult anyone’s mind in the process. That said, this is a risk that I am willing to take, so that all readers will find useful things to learn from this book. For example, mathematicians will learn the AI application, and data scientists and AI practitioners will learn deeper math.

这些部分都有技术难度,因此如果某个部分变得过于混乱,请在心里记下它的存在并跳到下一部分。您可以稍后再回到跳过的内容。

The sections go in and out of technical difficulty, so if a section gets too confusing, make a mental note of its existence and skip to the next section. You can come back to what you skipped later.

大多数章节都是独立的,因此读者可以直接跳到他们感兴趣的主题。当章节与其他章节相关时,我会指出。由于我试图使每一章尽可能独立,因此我可能会在不同的章节中重复一些解释。我把概率章节一直推到接近尾声(第11章),但我一直在使用和谈论概率分布(尤其是数据集特征的联合概率分布)。我们的想法是在学习语法之前习惯概率语言以及它与人工智能模型的关系,因此当我们开始学习语法时,我们对它适合的上下文有一个很好的了解。

Most of the chapters are independent, so readers can jump straight to their topics of interest. When chapters are related to other chapters, I point that out. Since I try to make each chapter as self-contained as possible, I may repeat a few explanations across different chapters. I push the probability chapter to all the way near the end (Chapter 11), but I use and talk about probability distributions all the time (especially the joint probability distribution of the features of a data set). The idea is to get used to the language of probability and how it relates to AI models before learning its grammar, so when we get to learning the grammar, we have a good idea of the context that it fits in.

我相信有两种类型的学习者:那些学习具体内容和细节,然后慢慢开始制定更大的图景和事物如何组合在一起的地图的人;另一种是那些学习细节的人。以及那些首先需要了解大局以及事物之间如何相互关联的人,然后仅在需要时才深入细节的人。两者同样重要,区别仅在于某人的大脑类型和自然倾向。我更倾向于第二类,而这本书反映了这一点:从上面看这一切是怎样的,数学和人工智能如何相互作用?结果可能感觉像是一阵旋风般的话题,但你会在另一方面拥有丰富的数学和人工智能知识基础,以及健康的信心。

I believe that there are two types of learners: those who learn the specifics and the details, then slowly start formulating a bigger picture and a map for how things fit together; and those who first need to understand the big picture and how things relate to each other, then dive into the details only when needed. Both are equally important, and the difference is only in someone’s type of brain and natural inclination. I tend to fit more into the second category, and this book is a reflection of that: how does it all look from above, and how do math and AI interact with each other? The result might feel like a whirlwind of topics, but you’ll come out on the other side with a great knowledge base for both math and AI, plus a healthy dose of confidence.

当我爸爸教我开车时,他坐在副驾驶座上让我开车。十分钟后,道路变成了悬崖路。他让我停下来,下了车,然后说:“现在开车,尽量不要掉下悬崖,不要害怕,我在看着”(好像这会有所帮助)。我没有从悬崖上掉下来,事实上我最喜欢悬崖边的路。现在将其与通过强化学习训练自动驾驶汽车联系起来,区别在于,对我来说,从悬崖上掉下来的成本将是负无穷大。我负担不起;我是真人,坐在真车里,不是模拟的。

When my dad taught me to drive, he sat in the passenger’s seat and asked me to drive. Ten minutes in, the road became a cliffside road. He asked me to stop, got out of the car, then said: “Now drive, just try not to fall off the cliff, don’t be afraid, I am watching” (like that was going to help). I did not fall off the cliff, and in fact I love cliffside roads the most. Now tie this to training self-driving cars by reinforcement learning, with the distinction that the cost of falling off the cliff would’ve been minus infinity for me. I could not afford that; I am a real person in a real car, not a simulation.

这就是你在本书中进行数学和人工智能的方法。没有介绍、结论、定义、定理、练习或类似的内容。有沉浸感。

This is how you’ll do math and AI in this book. There are no introductions, conclusions, definitions, theorems, exercises, or anything of the like. There is immersion.

你已经在其中了。你知道的。现在开车。

You are already in it. You know it. Now drive.

信息图

Infographic

我陪伴这本书带有信息图表,直观地将所有主题联系在一起。您还可以在本书的 GitHub 页面上找到此内容。

I accompany this book with an infographic, visually tying all the topics together. You can also find this on the book’s GitHub page.

埃麦P001

阅读本书需要具备哪些数学背景?

What Math Background Is Expected from You to Be Able to Read This Book?

这本书是独立的,因为我们激发了我们需要使用的一切。我确实希望您已经接触过微积分和一些线性代数,包括向量和矩阵运算,例如加法、乘法和一些矩阵分解。我还希望您知道什么是函数以及它如何将输入映射到输出。我们在人工智能中所做的大部分数学工作都涉及构造函数、评估函数、优化函数或组合一堆函数。您需要了解衍生品(它们衡量事物变化的速度)和衍生品的链式法则。您不一定需要知道如何计算每个函数的它们,因为计算机、Python、Desmos 和/或 Wolfram|Alpha 数学如今为我们做了很多事情,但您需要知道它们的含义。接触一些概率和统计思维也很有帮助。如果您不知道上述任何一项,那也完全没问题。您可能需要坐下来自己做一些示例(来自其他一些书中),以熟悉某些概念。这里的技巧是知道何时查找你不知道的东西......当你需要它们时,意味着仅当你遇到一个你不理解的术语,并且你很了解它的上下文时出现了。如果你真的是从头开始,你也不会落后太多。本书试图不惜一切代价避免技术细节。

This book is self-contained in the sense that we motivate everything that we need to use. I do hope that you have been exposed to calculus and some linear algebra, including vector and matrix operations, such as addition, multiplication, and some matrix decompositions. I also hope that you know what a function is and how it maps an input to an output. Most of what we do mathematically in AI involves constructing a function, evaluating a function, optimizing a function, or composing a bunch of functions. You need to know about derivatives (these measure how fast things change) and the chain rule for derivatives. You do not necessarily need to know how to compute them for each function, as computers, Python, Desmos, and/or Wolfram|Alpha mathematics do a lot for us nowadays, but you need to know their meaning. Some exposure to probabilistic and statistical thinking are helpful as well. If you do not know any of the above, that is totally fine. You might have to sit down and do some examples (from some other books) on your own to familiarize yourself with certain concepts. The trick here is to know when to look up the things that you do not know…​only when you need them, meaning only when you encounter a term that you do not understand, and you have a good idea of the context within which it appeared. If you are truly starting from scratch, you are not too far behind. This book tries to avoid technicalities at all costs.

章节概述

Overview of the Chapters

我们一共有14章。

We have a total of 14 chapters.

如果您关心数学和人工智能技术,因为它们与道德、政策、社会影响以及各种影响、机遇和挑战有关,那么请先阅读第 1 章和第14。如果您不关心这些,那么我们会证明您应该关心这些。在本书中,我们将数学视为看似不同主题的结合剂,而不是通常将数学视为复杂公式、定理和希腊字母的绿洲。

If you are a person who cares for math and the AI technology as they relate to ethics, policy, societal impact, and the various implications, opportunities, and challenges, then read Chapters 1 and 14 first. If you do not care for those, then we make the case that you should. In this book, we treat math as the binding agent of seemingly disparate topics, rather than the usual presentation of math as an oasis of complicated formulas, theorems, and Greek letters.

如果您从未遇到过微分方程(ODE 和 PDE),第 13 章可能会感觉与本书有所不同,但如果您热衷于数学建模、物理和自然科学、模拟或数学分析,并且您想要了解人工智能如何使您的领域受益,以及微分方程如何使人工智能受益。无数的科学成就都建立在微分方程的基础上,因此当我们正处于一种有潜力解决该领域许多长期存在的问题的计算技术的曙光时,我们不能忽视它们。本章对于人工智能本身来说并不重要,但对于我们对整个数学的一般理解以及为人工智能和神经算子建立理论基础至关重要。

Chapter 13 might feel separate from the book if you’ve never encountered differential equations (ODEs and PDEs), but you will appreciate it if you are into mathematical modeling, the physical and natural sciences, simulation, or mathematical analysis, and you would like to know how AI can benefit your field, and in turn how differential equations can benefit AI. Countless scientific feats build on differential equations, so we cannot leave them out when we are at a dawn of a computational technology that has the potential to address many of the field’s long-standing problems. This chapter is not essential for AI per se, but it is essential for our general understanding of mathematics as a whole, and for building theoretical foundations for AI and neural operators.

其余章节对于人工智能、机器学习和数据科学至关重要。第 6 章关于奇异值分解(主成分分析和潜在语义分析的基本数学,以及降维的好方法)没有最佳位置。当你阅读本章时,让你天生的好奇心决定:在你认为最合适的章节之前或之后。这完全取决于您的背景以及您来自哪个行业或学科。

The rest of the chapters are essential for AI, machine learning, and data science. There is no optimal location for Chapter 6 on the singular value decomposition (the essential math for principal component analysis and latent semantic analysis, and a great method for dimension reduction). Let your natural curiosity dictate when you read this chapter: before or after whichever chapter you feel would be the most fitting. It all depends on your background and which industry or academic discipline you happen to come from.

让我们简要概述一下第 1章到第 14章:

Let’s briefly overview Chapters 1 through 14:

第 1 章,“为什么要学习人工智能数学?”
Chapter 1, “Why Learn the Mathematics of AI?”

人工智能就在这里。它已经渗透到我们生活的许多领域,参与重要决策的制定,并且很快将应用于我们社会和运营的各个领域。该技术正在快速发展,其投资也在猛增。什么是人工智能?它能做什么?它有什么局限性?它要去哪里?最重要的是,它是如何工作的,为什么我们应该真正关心了解它是如何工作的?在这一介绍性章节中,我们简要概述了重要的人工智能应用、试图将人工智能集成到其系统中的公司通常遇到的问题、系统实施不当时发生的事件,以及人工智能解决方案中通常使用的数学。

Artificial intelligence is here. It has already penetrated many areas of our lives, is involved in making important decisions, and soon will be employed in every sector of our society and operations. The technology is advancing very fast and its investments are skyrocketing. What is artificial intelligence? What is it able to do? What are its limitations? Where is it headed? And most importantly, how does it work, and why should we really care about knowing how it works? In this introductory chapter we briefly survey important AI applications, the problems usually encountered by companies trying to integrate AI into their systems, incidents that happen when systems are not well implemented, and the math typically used in AI solutions.

第 2 章“数据、数据、数据”
Chapter 2, “Data, Data, Data”

本章强调了数据对于人工智能至关重要的事实。它解释了通常引起混淆的概念之间的差异:结构化和非结构化数据、线性和非线性模型、真实和模拟数据、确定性函数和随机变量、离散和连续分布、先验概率、后验概率和似然函数。它还提供了人工智能所需的概率和统计图,而无需深入研究任何细节,并介绍了最流行的概率分布。

This chapter highlights the fact that data is central to AI. It explains the differences between concepts that are usually a source of confusion: structured and unstructured data, linear and nonlinear models, real and simulated data, deterministic functions and random variables, discrete and continuous distributions, prior probabilities, posterior probabilities, and likelihood functions. It also provides a map for the probability and statistics needed for AI without diving into any details, and introduces the most popular probability distributions.

第 3 章,“将函数拟合到数据”
Chapter 3, “Fitting Functions to Data”

许多流行的机器学习模型(包括自 2012 年以来使人工智能重新成为大众关注焦点的高度成功的神经网络)的核心在于一个非常简单的数学问题:将给定的一组数据点拟合到适当的函数中,然后确保该函数在新数据上表现良好。本章通过真实数据集和其他简单示例强调了这一广泛有用的事实。我们讨论回归、逻辑回归、支持向量机和其他流行的机器学习技术,有一个统一的主题:训练函数、损失函数和优化。

At the core of many popular machine learning models, including the highly successful neural networks that brought artificial intelligence back into the popular spotlight since 2012, lies a very simple mathematical problem: fit a given set of data points into an appropriate function, then make sure this function performs well on new data. This chapter highlights this widely useful fact with a real data set and other simple examples. We discuss regression, logistic regression, support vector machines, and other popular machine learning techniques, with one unifying theme: training function, loss function, and optimization.

第 4 章,“神经网络优化”
Chapter 4, “Optimization for Neural Networks”

神经网络以大脑皮层为模型,其中涉及以分层结构排列的数百万个神经元。大脑通过在面对以前见过的概念时加强神经元连接来进行学习,并在学习到与先前学到的概念相矛盾的新信息时削弱连接。机器只能理解数字。从数学上讲,较强的连接对应于较大的数字(权重),较弱的连接对应于较小的数字。本章解释了训练神经网络时使用的优​​化和反向传播步骤,类似于我们大脑中的学习方式(并非人类完全理解这一点)。它还介绍了各种正则化技术,解释了它们的优点、缺点和用例。此外,我们解释了神经网络逼近理论和通用逼近定理背后的直觉。

Neural networks are modeled after the brain cortex, which involves millions of neurons arranged in a layered structure. The brain learns by reinforcing neuron connections when faced with a concept it has seen before, and weakening connections if it learns new information that undoes or contradicts previously learned concepts. Machines only understand numbers. Mathematically, stronger connections correspond to larger numbers (weights), and weaker connections correspond to smaller numbers. This chapter explains the optimization and backpropagation steps used when training neural networks, similar to how learning happens in our brain (not that humans fully understand this). It also walks through various regularization techniques, explaining their advantages, disadvantages, and use cases. Furthermore, we explain the intuition behind approximation theory and the universal approximation theorem for neural networks.

第 5 章,“卷积神经网络和计算机视觉”
Chapter 5, “Convolutional Neural Networks and Computer Vision”

卷积神经网络在计算机视觉和自然语言处理领域广泛流行。在本章中,我们从卷积和互相关运算开始,然后调查它们在系统设计以及滤波信号和图像中的用途。然后我们将卷积与神经网络结合起来,从图像中提取高阶特征。

Convolutional neural networks are widely popular for computer vision and natural language processing. In this chapter we start with the convolution and cross-correlation operations, then survey their uses in systems design and filtering signals and images. Then we integrate convolution with neural networks to extract higher-order features from images.

第 6 章,“奇异值分解:图像处理、自然语言处理和社交媒体
Chapter 6, “Singular Value Decomposition: Image Processing, Natural Language Processing, and Social Media

对角矩阵表现得像标量,因此非常理想。奇异值分解是线性代数中极其重要的方法,可将稠密矩阵转换为对角矩阵。在此过程中,它揭示了矩阵对空间本身的作用:旋转和/或反射、拉伸和/或挤压。我们可以将这个简单的过程应用于任何数字矩阵。这种广泛的适用性,以及在保留基本信息的同时大幅降低维度的能力,使得奇异值分解在数据科学、人工智能和机器学习领域很受欢迎。它是主成分分析和潜在语义分析背后的基本数学。本章将介绍奇异值分解及其最相关和最新的应用。

Diagonal matrices behave like scalar numbers and hence are highly desirable. Singular value decomposition is a crucially important method from linear algebra that transforms a dense matrix into a diagonal matrix. In the process, it reveals the action of a matrix on space itself: rotating and/or reflecting, stretching and/or squeezing. We can apply this simple process to any matrix of numbers. This wide applicability, along with the ability to dramatically reduce the dimensions while retaining essential information, make singular value decomposition popular in the fields of data science, AI, and machine learning. It is the essential mathematics behind principal component analysis and latent semantic analysis. This chapter walks through singular value decomposition along with its most relevant and up-to-date applications.

第 7 章,“自然语言和金融人工智能:矢量化和时间序列”
Chapter 7, “Natural Language and Finance AI: Vectorization and Time Series”

我们在自然语言模型的背景下介绍本章中的数学知识,例如识别主题、机器翻译和注意力模型。需要克服的主要障碍是从具有意义的单词和句子转向机器可以处理的低维数字向量。我们讨论最先进的模型,例如 Google Brain 的 Transformer(从 2017 年开始)等,同时我们只关注相关的数学。时间序列数据和模型(例如循环神经网络)自然地出现在这里。我们简单介绍一下金融人工智能,因为它在建模以及两个领域如何相互融合方面与自然语言都有重叠。

We present the mathematics in this chapter in the context of natural language models, such as identifying topics, machine translation, and attention models. The main barrier to overcome is moving from words and sentences that carry meaning to low-dimensional vectors of numbers that a machine can process. We discuss state-of-the-art models such as Google Brain’s transformer (starting in 2017), among others, while we keep our attention only the relevant math. Time series data and models (such as recurrent neural networks) appear naturally here. We briefly introduce finance AI, as it overlaps with natural language both in terms of modeling and how the two fields feed into each other.

第 8 章“概率生成模型”
Chapter 8, “Probabilistic Generative Models”

机器生成的图像,包括人类在内,这些都变得越来越现实。如今,很难区分时尚界模特的图像是真人图像还是计算机生成的图像。我们要感谢生成对抗网络(GAN)和其他生成模型带来的这一进步,在虚拟和现实之间划清界限变得更加困难。生成对抗网络旨在使用两个神经网络重复一个简单的数学过程,直到机器本身无法区分真实图像和计算机生成的图像,从而获得“非常接近现实”的成功。当两个神经网络相互“竞争”时,博弈论和零和博弈在这里自然发生。本章探讨了模仿人类思维的生成模型。这些模型具有广泛的应用,从增强数据集到完成蒙面人脸,再到高能物理,例如模拟与 CERN 大型强子产生的数据集类似的数据集对撞机。

Machine-generated images, including those of humans, are becoming increasingly realistic. It is very hard nowadays to tell whether an image of a model in the fashion industry is that of a real person or a computer-generated image. We have generative adversarial networks (GANs) and other generative models to thank for this progress, where it is harder to draw a line between the virtual and the real. Generative adversarial networks are designed to repeat a simple mathematical process using two neural networks until the machine itself cannot tell the difference between a real image and a computer-generated one, hence the “very close to reality” success. Game theory and zero-sum games occur naturally here, as the two neural networks “compete” against each other. This chapter surveys generative models, which mimic imagination in the human mind. These models have a wide range of applications, from augmenting data sets to completing masked human faces to high energy physics, such as simulating data sets similar to those produced at the CERN Large Hadron Collider.

第 9 章“图模型”
Chapter 9, “Graph Models”

图表和网络无处不在:城市和路线图、机场和转机航班、万维网、云(计算)、分子网络、我们的神经系统、社交网络、恐怖组织网络,甚至各种机器学习模型和人工神经网络。具有自然图结构的数据可以通过利用和保留该结构的机制来更好地理解,构建直接在图上操作的函数,而不是将图数据嵌入到现有的机器学习模型中,后者试图在分析数据之前人为地重塑数据。这与卷积神经网络在图像数据上成功、循环神经网络在序列数据上成功等等的原因相同。图神经网络背后的数学是图论、计算和神经网络之间的结合。本章在许多应用的背景下审视了这一数学。

Graphs and networks are everywhere: cities and roadmaps, airports and connecting flights, the World Wide Web, the cloud (in computing), molecular networks, our nervous system, social networks, terrorist organization networks, even various machine learning models and artificial neural networks. Data that has a natural graph structure can be better understood by a mechanism that exploits and preserves that structure, building functions that operate directly on graphs, as opposed to embedding graph data into existing machine learning models that attempt to artificially reshape the data before analyzing it. This is the same reason convolutional neural networks are successful with image data, recurrent neural networks are successful with sequential data, and so on. The mathematics behind graph neural networks is a marriage among graph theory, computing, and neural networks. This chapter surveys this mathematics in the context of many applications.

第10章“运筹学”
Chapter 10, “Operations Research”

其他运筹学的合适名称是物流优化。本章向读者介有限的资源。解决这些问题所需的数学知识来自优化、博弈论、对偶性、图论、动态规划和算法。

Another suitable name for operations research would be optimization for logistics. This chapter introduces the reader to problems at the intersection of AI and operations research, such as supply chain, traveling salesman, scheduling and staffing, queuing, and other problems whose defining features are high dimensionality, complexity, and the need to balance competing interests and limited resources. The math required to address these problems draws from optimization, game theory, duality, graph theory, dynamic programming, and algorithms.

第11章“概率”
Chapter 11, “Probability”

概率论提供了量化随机性和不确定性的系统方法。它将逻辑推广到人工智能中最重要的情况:当信息和知识不确定时。本章重点介绍人工智能应用中使用的基本概率:贝叶斯网络和因果建模、悖论、大型随机矩阵、随机过程、马尔可夫链和强化学习。它以严格的概率论结束,它揭开了测度论的神秘面纱,并向感兴趣的读者介绍了神经网络的通用逼近定理。

Probability theory provides a systematic way to quantify randomness and uncertainty. It generalizes logic to situations that are of paramount importance in artificial intelligence: when information and knowledge are uncertain. This chapter highlights the essential probability used in AI applications: Bayesian networks and causal modeling, paradoxes, large random matrices, stochastic processes, Markov chains, and reinforcement learning. It ends with rigorous probability theory, which demystifies measure theory and introduces interested readers to the universal approximation theorem for neural networks.

第十二章“数理逻辑”
Chapter 12, “Mathematical Logic”

这个重要的主题位于接近结尾处,以免打断本书的自然流畅。设计能够收集知识、对自身存在的环境进行逻辑推理并根据逻辑推理做出推论和良好决策的代理是人工智能的核心。本章简要概述了基于知识的智能代理中的命题逻辑、一阶逻辑、概率逻辑、模糊逻辑和时态逻辑。

This important topic is positioned near the end to not interrupt the book’s natural flow. Designing agents that are able to gather knowledge, reason logically about the environment within which they exist, and make inferences and good decisions based on this logical reasoning is at the heart of artificial intelligence. This chapter briefly surveys propositional logic, first-order logic, probabilistic logic, fuzzy logic, and temporal logic, within an intelligent knowledge-based agent.

第13章“人工智能和偏微分方程”
Chapter 13, “Artificial Intelligence and Partial Differential Equations”

微分方程对现实世界中的无数现象进行建模,从空气湍流到星系,到股票市场,再到材料和人口增长的行为。现实模型通常很难求解,并且在依赖传统数值技术时需要大量的计算能力。人工智能最近介入加速微分方程的求解。本章的第一部分充当微分方程的速成课程,突出显示最重要的主题并让读者对该主题有一个鸟瞰图。第二部分探索基于人工智能的新方法,简化求解微分方程的整个过程。这些有可能解决自然科学、金融和其他领域长期存在的问题。

Differential equations model countless phenomena in the real world, from air turbulence to galaxies to the stock market to the behavior of materials and population growth. Realistic models are usually very hard to solve and require a tremendous amount of computational power when relying on traditional numerical techniques. AI has recently stepped in to accelerate solving differential equations. The first part of this chapter acts as a crash course on differential equations, highlighting the most important topics and arming the reader with a bird’s-eye view of the subject. The second part explores new AI-based methods that simplify the whole process of solving differential equations. These have the potential to unlock long-standing problems in the natural sciences, finance, and other fields.

第 14 章,“人工智能、伦理、数学、法律和政策
Chapter 14, “Artificial Intelligence, Ethics, Mathematics, Law, and Policy

我相信这一章应该是任何人工智能书籍的第一章;然而,这个主题是如此广泛和深刻,需要多本书才能完全涵盖。本章仅触及表面,总结了与人工智能相关的各种伦理问题,包括:公平、公平、偏见、包容性、透明度、政策、监管、隐私、武器化和安全。它提出了每个问题以及可能的解决方案(数学或政策和法规)。

I believe this chapter should be the first chapter in any book on artificial intelligence; however, this topic is so wide and deep that it needs multiple books to cover it completely. This chapter only scratches the surface and summarizes various ethical issues associated with artificial intelligence, including: equity, fairness, bias, inclusivity, transparency, policy, regulation, privacy, weaponization, and security. It presents each problem along with possible solutions (mathematical or with policy and regulation).

我最喜欢的人工智能书籍

My Favorite Books on AI

关于人工智能以及与该领域密切相关的主题的许多优秀且极具洞察力的书籍。以下还不是详尽的列表。有些是大量数学知识的技术书籍,而另一些则是介绍性的或完全非技术性的书籍。有些是面向代码的(Python 3),有些则不是。我从他们所有人身上学到了很多东西:

There are many excellent and incredibly insightful books on AI and on topics intimately related to the field. The following is not even close to being an exhaustive list. Some are technical books heavy on mathematics, and others are either introductory or completely nontechnical. Some are code-oriented (Python 3) and others are not. I have learned a lot from all of them:

  • Brunton、Steven L. 和 J. Nathan Kutz,数据驱动科学与工程:机器学习、动态系统和控制(剑桥大学出版社,2022 年)

  • Brunton, Steven L. and J. Nathan Kutz, Data-Driven Science and Engineering: Machine Learning, Dynamical Systems and Control (Cambridge University Press, 2022)

  • Crawford, Kate,人工智能地图集(耶鲁大学出版社,2021 年)

  • Crawford, Kate, Atlas of AI (Yale University Press, 2021)

  • 马丁·福特,《智能建筑师》(Packt Publishing,2018 年)

  • Ford, Martin, Architects of Intelligence (Packt Publishing, 2018)

  • Géron、Aurélien,使用 Scikit-Learn、Keras 和 TensorFlow 进行机器学习实践(O'Reilly,2022 年)

  • Géron, Aurélien, Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (O’Reilly, 2022)

  • Goodfellow、Ian、Yoshua Bengio 和 Aaron Courville,《深度学习》(麻省理工学院出版社,2016 年)

  • Goodfellow, Ian, Yoshua Bengio, and Aaron Courville, Deep Learning (MIT Press, 2016)

  • Grus、Joel,从头开始的数据科学(O'Reilly,2019)

  • Grus, Joel, Data Science from Scratch (O’Reilly, 2019)

  • 杰夫·霍金斯,《一千个大脑》(基础书籍,2021 年)

  • Hawkins, Jeff, A Thousand Brains (Basic Books, 2021)

  • Izenman, Alan J.,现代多元统计技术(Springer,2013 年)

  • Izenman, Alan J., Modern Multivariate Statistical Techniques (Springer, 2013)

  • Jones、Herbert,数据科学:数据分析、数据挖掘、数据仓储、数据可视化、回归分析、数据库查询、商业大数据和初学者机器学习的终极指南(Bravex Publications,2020 年

  • Jones, Herbert, Data Science: The Ultimate Guide to Data Analytics, Data Mining, Data Warehousing, Data Visualization, Regression Analysis, Database Querying, Big Data for Business and Machine Learning for Beginners (Bravex Publications, 2020)

  • Kleppmann、Martin,设计数据密集型应用程序(O'Reilly,2017 年)

  • Kleppmann, Martin, Designing Data-Intensive Applications (O’Reilly, 2017)

  • Lakshmanan、Valliappa、Sara Robinson 和 Michael Munn,机器学习设计模式(O'Reilly,2020 年)

  • Lakshmanan, Valliappa, Sara Robinson, and Michael Munn, Machine Learning Design Patterns (O’Reilly, 2020)

  • Lane、Hobson、Hannes Hapke 和 Cole Howard,《自然语言处理实践》(曼宁,2019 年)

  • Lane, Hobson, Hannes Hapke, and Cole Howard, Natural Language Processing in Action (Manning, 2019)

  • 李开复,人工智能超级大国(Houghton Mifflin Harcourt,2018)

  • Lee, Kai-Fu, AI Superpowers (Houghton Mifflin Harcourt, 2018)

  • Macey, Tobias 编辑,每个数据工程师应该知道的 97 件事(O'Reilly,2021 年)

  • Macey, Tobias, ed., 97 Things Every Data Engineer Should Know (O’Reilly, 2021)

  • Marr、Bernard 和 Matt Ward,人工智能实践(Wiley,2019)

  • Marr, Bernard and Matt Ward, Artificial Intelligence in Practice (Wiley, 2019)

  • Moroney, Laurence,《程序员的人工智能和机器学习》(O'Reilly,2021 年)

  • Moroney, Laurence, AI and Machine Learning for Coders (O’Reilly, 2021)

  • Mount, George,进军分析领域:从 Excel 到 Python 和 R(O'Reilly,2021 年)

  • Mount, George, Advancing into Analytics: From Excel to Python and R (O’Reilly, 2021)

  • Norvig、Peter 和 Stuart Russell,《人工智能:一种现代方法》(皮尔逊出版社,2021 年)

  • Norvig, Peter and Stuart Russell, Artificial Intelligence: A Modern Approach (Pearson, 2021)

  • 珀尔、朱迪亚、《为什么之书》(基础书籍,2020 年)

  • Pearl, Judea, The Book of Why (Basic Books, 2020)

  • Planche、Benjamin 和 Eliot Andres,《TensorFlow2 计算机视觉实践》(Packt Publishing,2019 年)

  • Planche, Benjamin and Eliot Andres, Hands-On Computer Vision with TensorFlow2 (Packt Publishing, 2019)

  • Potters、Marc 和 Jean-Philippe Bouchaud,《物理学家、工程师和数据科学家的随机矩阵理论第一门课程》(剑桥大学出版社,2020 年)

  • Potters, Marc, and Jean-Philippe Bouchaud, A First Course in Random Matrix Theory for Physicists, Engineers, and Data Scientists (Cambridge University Press, 2020)

  • Rosenthal, Jeffrey S.,《严格概率论初探》(世界科学出版社,2016 年)

  • Rosenthal, Jeffrey S., A First Look at Rigorous Probability Theory (World Scientific Publishing, 2016)

  • Roshak, Michael,物联网人工智能食谱(Packt Publishing,2021 年)

  • Roshak, Michael, Artificial Intelligence for IoT Cookbook (Packt Publishing, 2021)

  • Strang,吉尔伯特,线性代数和从数据中学习(韦尔斯利剑桥出版社,2019 年)

  • Strang, Gilbert, Linear Algebra and Learning from Data (Wellesley Cambridge Press, 2019)

  • Stone, James V.,人工智能引擎(Sebtel Press,2020 年)

  • Stone, James V., Artificial Intelligence Engines (Sebtel Press, 2020)

  • Stone, James V.,贝叶斯规则,贝叶斯分析教程简介(Sebtel Press,2013 年)

  • Stone, James V., Bayes’ Rule, A Tutorial Introduction to Bayesian Analysis (Sebtel Press, 2013)

  • Stone, James V.,信息论:教程简介(Sebtel Press,2015)

  • Stone, James V., Information Theory: A Tutorial Introduction (Sebtel Press, 2015)

  • Vajjala、Sowmya 等人,实用自然语言处理(O'Reilly,2020)

  • Vajjala, Sowmya et al., Practical Natural Language Processing (O’Reilly, 2020)

  • Van der Hofstad,Remco,随机图和复杂网络(剑桥,2017 年)

  • Van der Hofstad, Remco, Random Graphs and Complex Networks (Cambridge, 2017)

  • Vershynin,Roman,高维概率:数据科学应用简介(剑桥大学出版社,2018)

  • Vershynin, Roman, High-Dimensional Probability: An Introduction with Applications in Data Science (Cambridge University Press, 2018)

本书中使用的约定

Conventions Used in This Book

本书使用以下印刷约定:

The following typographical conventions are used in this book:

斜体
Italic

表示新术语、URL、电子邮件地址、文件名和文件扩展名。

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width
Constant width

用于程序列表,以及在段落中引用程序元素,例如变量或函数名称、数据库、数据类型、环境变量、语句和关键字。

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold
Constant width bold

显示应由用户逐字键入的命令或其他文本。

Shows commands or other text that should be typed literally by the user.

Constant width italic
Constant width italic

显示应替换为用户提供的值或上下文确定的值的文本。

Shows text that should be replaced with user-supplied values or by values determined by context.

该元素表示提示或建议。

This element signifies a tip or suggestion.

该元素表示一般注释。

This element signifies a general note.

该元素表示警告或警告。

This element indicates a warning or caution.

使用代码示例

Using Code Examples

本书中的极少数代码示例可以从https://github.com/halanelson/Essential-Math-For-AI下载。

The very few code examples that we have in this book are available for download at https://github.com/halanelson/Essential-Math-For-AI.

如果您有技术问题或使用代码示例时遇到问题,请发送电子邮件至

If you have a technical question or a problem using the code examples, please send an email to .

本书旨在帮助您完成工作。一般来说,如果本书提供了示例代码,您就可以在您的程序和文档中使用它。除非您要复制大部分代码,否则您无需联系我们以获得许可。例如,编写使用本书中的几段代码的程序不需要许可。销售或分发 O'Reilly 书籍中的示例确实需要许可。通过引用本书和示例代码来回答问题不需要许可。将本书中的大量示例代码合并到您的产品文档中确实需要许可。

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

我们赞赏但通常不要求归属。归属通常包括标题、作者、出版商和 ISBN。例如:“人工智能的基本数学,作者:Hala Nelson (O'Reilly)。版权所有 2023 哈拉·尼尔森,978-1-098-10763-5。”

We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Essential Math for AI by Hala Nelson (O’Reilly). Copyright 2023 Hala Nelson, 978-1-098-10763-5.”

如果您认为您对代码示例的使用不符合合理使用或上述许可的范围,请随时通过与我们联系。

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .

奥莱利在线学习

O’Reilly Online Learning

40 多年来,O'Reilly Media一直提供技术和业务培训、知识和见解来帮助公司取得成功。

For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.

我们独特的专家和创新者网络通过书籍、文章和我们的在线学习平台分享他们的知识和专业知识。O'Reilly 的在线学习平台让您可以按需访问实时培训课程、深入学习路径、交互式编码环境以及来自 O'Reilly 和 200 多家其他出版商的大量文本和视频。欲了解更多信息,请访问https://oreilly.com

Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com.

如何联系我们

How to Contact Us

请向出版商提出有关本书的意见和问题:

Please address comments and questions concerning this book to the publisher:

  • 奥莱利媒体公司
  • O’Reilly Media, Inc.
  • 格拉文斯坦公路北1005号
  • 1005 Gravenstein Highway North
  • 塞瓦斯托波尔, CA 95472
  • Sebastopol, CA 95472
  • 800-998-9938(美国或加拿大)
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515(国际或本地)
  • 707-829-0515 (international or local)
  • 707-829-0104(传真)
  • 707-829-0104 (fax)

我们有本书的网页,其中列出了勘误表、示例和任何其他信息。您可以通过https://oreil.ly/essentialMathAI访问此页面。

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/essentialMathAI.

发送电子邮件至发表评论或询问有关本书的技术问题。

Email to comment or ask technical questions about this book.

有关我们的书籍和课程的新闻和信息,请访问https://oreilly.com

For news and information about our books and courses, visit https://oreilly.com.

在 LinkedIn 上找到我们: https: //linkedin.com/company/oreilly-media

Find us on LinkedIn: https://linkedin.com/company/oreilly-media

在 Twitter 上关注我们: https: //twitter.com/oreillymedia

Follow us on Twitter: https://twitter.com/oreillymedia

在 YouTube 上观看我们: https: //www.youtube.com/oreillymedia

Watch us on YouTube: https://www.youtube.com/oreillymedia

致谢

Acknowledgments

我的父亲 Yousef Zein 教我数学,并总是提醒我:不要认为我们这辈子给你的最好的东西是土地或金钱。这些来来去去。人类创造货币、购买资产并创造更多货币。我们给你的是一个大脑,一个非常好的大脑。那是你真正的资产,所以出去利用它。我爱你的大脑,这本书是为你而写的,爸爸。

My dad, Yousef Zein, who taught me math, and made sure to always remind me: Don’t think that the best thing we gave you in this life is land, or money. These come and go. Humans create money, buy assets, and create more money. What we did give you is a brain, a really good brain. That’s your real asset, so go out and use it. I love your brain, this book is for you, dad.

我的妈妈萨米拉·哈姆丹(Samira Hamdan)教我英语和哲学,她放弃了一切以确保我们幸福和成功。我用英语写了这本书,而不是我的母语,谢谢你,妈妈。

My mom, Samira Hamdan, who taught me both English and philosophy, and who gave up everything to make sure we were happy and successful. I wrote this book in English, not my native language, thanks to you, mom.

我的女儿萨莉在我最脆弱的时候让我活了下来,她是我生命中的快乐。

My daughter, Sary, who kept me alive during the most vulnerable times, and who is the joy of my life.

我的丈夫基思给了我爱、热情和稳定性,让我能够做自己,做很多事情,其中​​一些是不明智的,比如写一本五百页左右的数学和数学书籍。人工智能。我爱你。

My husband, Keith, who gives me the love, passion, and stability that allow me to be myself, and to do so many things, some of them unwise, like writing a five-hundred-or-so-page book on math and AI. I love you.

我的妹妹拉莎是我的灵魂伴侣。这说明了一切。

My sister, Rasha, who is my soulmate. This says it all.

我的兄弟海瑟姆违背了我们所有的文化规范和传统来支持我。

My brother, Haitham, who went against all our cultural norms and traditions to support me.

我的叔叔奥马尔·蔡因(Omar Zein)的记忆,他也教我哲学,让我爱上了人类心灵的奥秘。

The memory of my uncle Omar Zein, who also taught me philosophy, and who made me fall in love with the mysteries of the human mind.

我的朋友莎伦和杰米让我在他们家里写了这本书的大部分内容,只要我提出要求,他们都是很棒的编辑。

My friends Sharon and Jamie, who let me write massive portions of this book at their house, and were great editors any time I asked.

我一生的朋友奥伦(Oren)不仅是任何人都希望的最好的朋友之一,他同意阅读并评论这本书。

My lifetime friend Oren, who on top of being one of the best friends anyone can wish for, agreed to read and review this book.

我的朋友Huan Nguyen,他的故事应该是一本自己的书,他也花时间阅读和评论了这本书。谢谢你,海军上将。

My friend Huan Nguyen, whose story should be its own book, and who also took the time to read and review this book. Thank you, admiral.

我的朋友兼同事约翰·韦伯逐字阅读了每一章,并提供了他宝贵的纯数学观点。

My friend and colleague John Webb, who read every chapter word by word, and provided his invaluable pure math perspective.

我的好朋友 Deb、Pankaj、Jamie、Tamar、Sajida、Jamila、Jen、Mattias 和 Karen,他们都是我的家人。我热爱和你在一起的生活。

My wonderful friends Deb, Pankaj, Jamie, Tamar, Sajida, Jamila, Jen, Mattias, and Karen, who are part of my family. I love life with you.

我的导师罗伯特·科恩(纽约大学)和约翰·肖特兰(耶鲁大学),我在职业生涯中实现了许多里程碑。我从你身上学到了很多东西。

My mentors Robert Kohn (New York University) and John Schotland (Yale University), to whom I owe reaching many milestones in my career. I learned a great deal from you.

对彼得的记忆,他的影响是巨大的,他将永远激励我。

The memory of Peter, whose impact was monumental, and who will forever inspire me.

这本书的审阅者们在繁忙的日程中抽出时间和精力,让这本书变得更好。感谢您丰富的专业知识,并慷慨地向我提供来自不同领域的独特观点。

The reviewers of this book, who took time and care despite their busy schedules to make it much better. Thank you for your great expertise and for generously giving me your unique perspectives from all your different domains.

世界上许多城市的所有男女服务员都容忍我在他们的餐厅里坐在笔记本电脑前几个小时、几个小时、几个小时写这本书。我从你那里得到了很多能量和快乐。

All the waiters and waitresses in many cities in the world, who tolerated me sitting at my laptop at their restaurants for hours and hours and hours, writing this book. I got so much energy and happiness from you.

我的编辑安吉拉·鲁菲诺 (Angela Rufino) 和克里斯汀·布朗 (Kristen Brown) 令人难以置信耐心、开朗且始终支持我。

My incredible, patient, cheerful, and always supportive editors, Angela Rufino and Kristen Brown.

第 1 章为什么要学习人工智能数学?

Chapter 1. Why Learn the Mathematics of AI?

直到有人说“它很聪明”,我才停止搜索,开始注意。

H。

It is not until someone said, “It is intelligent,” that I stopped searching, and paid attention.

H.

人工智能,被称为AI的,就在这里。它已经渗透到我们生活的多个方面,并且越来越多地参与非常重要的决策。很快它将应用于我们社会的各个领域,为我们的大部分日常运营提供动力。该技术正在快速发展,其投资也在猛增。与此同时,我们感觉我们正处于人工智能的狂热之中。我们每天都会听到新的人工智能成就。人工智能在围棋比赛中击败了最好的人类棋手。人工智能在分类任务中的表现优于人类视觉。人工智能制造深度赝品。人工智能生成高能物理数据。人工智能可以解决模拟世界自然现象的困难偏微分方程。自动驾驶汽车已经上路了。送货无人机正在世界一些地区盘旋。

Artificial intelligence, known as AI, is here. It has penetrated multiple aspects of our lives and is increasingly involved in making very important decisions. Soon it will be employed in every sector of our society, powering most of our daily operations. The technology is advancing very fast and its investments are skyrocketing. At the same time, it feels like we are in the middle of an AI frenzy. Every day we hear about a new AI accomplishment. AI beats the best human player at a Go game. AI outperforms human vision in classification tasks. AI makes deep fakes. AI generates high energy physics data. AI solves difficult partial differential equations that model the natural phenomena of the world. Self-driving cars are on the roads. Delivery drones are hovering in some parts of the world.

我们还听说人工智能看似无限的潜力。人工智能将彻底改变医疗保健和教育。人工智能将消除全球饥饿。人工智能将应对气候变化。人工智能将拯救濒危物种。人工智能将对抗疾病。人工智能将优化供应链。人工智能将解开生命的起源。人工智能将绘制可观测宇宙的地图。我们的城市和家庭将变得智能。最终,我们进入了科幻小说的领域。人类将把他们的大脑上传到计算机中。人类将因人工智能而得到增强。最后,恐惧和怀疑的声音出现:人工智能将接管并毁灭人类。

We also hear about AI’s seemingly unlimited potential. AI will revolutionize healthcare and education. AI will eliminate global hunger. AI will fight climate change. AI will save endangered species. AI will battle disease. AI will optimize the supply chain. AI will unravel the origins of life. AI will map the observable universe. Our cities and homes will be smart. Eventually, we cross into science fiction territory. Humans will upload their brains into computers. Humans will be enhanced by AI. Finally, the voices of fear and skepticism emerge: AI will take over and destroy humanity.

在这种狂热中,现实、猜测、夸张、愿望和纯粹虚构之间的界限变得模糊,我们必须首先定义人工智能,至少在本书的背景下。然后我们将讨论它的一些局限性和发展方向,并为当今人工智能中使用的数学奠定基础。我希望当你理解了数学之后,你能够从一个相对深入的角度来看待这个问题,虚构与现实以及介于两者之间的一切之间的模糊界限将会变得更加清晰。您还将学习人工智能中最先进的数学背后的主要思想,让您充满使用、改进甚至创建全新人工智能系统所需的信心。

Amid this frenzy, where the lines between reality, speculation, exaggeration, aspiration, and pure fiction are blurred, we must first define AI, at least within the context of this book. We will then discuss some of its limitations, where it is headed, and set the stage for the mathematics that is used in today’s AI. My hope is that when you understand the mathematics, you will be able to look at the subject from a relatively deep perspective, and the blurring lines between fiction, reality, and everything in between will become more clear. You will also learn the main ideas behind state-of-the-art math in AI, arming you with the confidence needed to use, improve, or even create entirely new AI systems.

什么是人工智能?

What Is AI?

迄今为止,人工智能还没有一个统一的定义。如果我们询问两位人工智能专家,我们会听到两种不同的答案。即使我们在不同的两天询问同一位专家,他们也可能会得出两个不同的定义。造成这种不一致以及似乎无法定义人工智能的原因是,到目前为止,还不清楚“我”的定义什么。什么是智力?是什么让我们成为人类且独一无二?是什么让我们意识到自己的存在?我们大脑中的神经元如何聚集微小的电脉冲并将其转化为图像、声音、感觉和思想?这些庞大的话题几个世纪以来一直让哲学家、人类学家和神经科学家着迷。我不会尝试在本书中讨论这些问题。然而,我将通过人工智能代理来讨论人工智能,并出于本书的目的列出以下定义原则。到 2022 年,人工智能代理可以是以下一项或多项:下列的:

I have yet to come across a unified definition of AI. If we ask two AI experts, we hear two different answers. Even if we ask the same expert on two different days, they might come up with two different definitions. The reason for this inconsistency and seeming inability to define AI is that until now it has not been clear what the definition of the I is. What is intelligence? What makes us human and unique? What makes us conscious of our own existence? How do neurons in our brain aggregate tiny electric impulses and translate them into images, sounds, feelings, and thoughts? These are vast topics that have fascinated philosophers, anthropologists, and neuroscientists for centuries. I will not attempt to go there in this book. I will, however, address artificial intelligence in terms of an AI agent and list the following defining principles for the purposes of this book. In 2022, an AI agent can be one or more of the following:

  • 人工智能代理可以是纯软件,也可以具有物理机器人身体。

  • An AI agent can be pure software or have a physical robotic body.

  • 人工智能代理可以针对特定任务,也可以是一个灵活的代理,探索和操纵其环境,在有或没有特定目标的情况下构建知识。

  • An AI agent can be geared toward a specific task, or be a flexible agent exploring and manipulating its environment, building knowledge with or without a specific aim.

  • 人工智能代理通过经验来学习,也就是说,通过更多的练习,它可以更好地执行某项任务。

  • An AI agent learns with experience, that is, it gets better at performing a task with more practice at that task.

  • 人工智能代理感知其环境,然后构建、更新和/或发展该环境的模型。

  • An AI agent perceives its environment, then builds, updates, and/or evolves a model for this environment.

  • 人工智能代理感知、建模、分析并做出决策,以实现其目标。这个目标可以是预定义的和固定的,也可以是可变的并随着更多的输入而变化。

  • An AI agent perceives, models, analyzes, and makes decisions that lead to accomplishing its goal. This goal can be predefined and fixed, or variable and changing with more input.

  • 人工智能代理了解因果关系,并且可以区分模式和原因之间的区别。

  • An AI agent understands cause and effect, and can tell the difference between patterns and causes.

每当人工智能的数学模型受到我们大脑工作方式的启发时,我都会指出这个类比,从而将人工智能和人类智能进行比较,而不必定义两者。尽管今天的人工智能与人类智能相差甚远,除了图像分类、AlphaGo 等特定任务外,但最近有如此多的人类大脑聚集在一起开发人工智能,该领域必将在未来几年发展并取得突破。

Whenever a mathematical model for AI is inspired by the way our brain works, I will point out the analogy, hence keeping AI and human intelligence in comparison, without having to define either. Even though today’s AI is nowhere close to human intelligence, except for specific tasks such as image classification, AlphaGo, etc., so many human brains have recently converged to develop AI that the field is bound to grow and have breakthroughs in the coming years.

同样重要的是要注意,有些人交替使用人工智能、机器学习和数据科学等术语。这三个领域有重叠,但并不相同。第四个非常重要但不太被炒作的领域是机器人技术,其中物理部件和运动技能必须整合到学习和推理过程中,将机械工程、电气工程和生物工程与信息和计算机工程相结合。思考这些领域互连性的一种快速方法是:数据推动机器学习算法,进而为许多流行的人工智能和/或机器人系统提供动力。本书中的数学在不同程度上对所有人都有用。四个域。

It is also important to note that some people use the terms artificial intelligence, machine learning, and data science interchangeably. These three domains overlap but they are not the same. The fourth very important but slightly less hyped area is that of robotics, where physical parts and motor skills must be integrated into the learning and reasoning processes, merging mechanical engineering, electrical engineering, and bioengineering with information and computer engineering. One fast way to think about the interconnectivity of these fields is: data fuels machine learning algorithms that in turn power many popular AI and/or robotics systems. The mathematics in this book is useful, in different proportions, for all four domains.

为什么人工智能现在如此流行?

Why Is AI So Popular Now?

在里面过去十年,人工智能之所以受到全世界的关注,得益于以下因素的成功结合:

In the past decade, AI has sprung into worldwide attention due to the successful combination of following factors:

海量数据的生成和数字化
Generation and digitization of massive amounts of data

这可能包括文本数据、图像、视频、健康记录、电子商务、网络和传感器数据。社交媒体和物联网通过持续传输大量数据在这里发挥了非常重要的作用。

This may include text data, images, videos, health records, e-commerce, network, and sensor data. Social media and the Internet of Things have played a very significant role here with their continuous streaming of great volumes of data.

计算能力的进步
Advances in computational power

这是通过并行和分布式计算以及硬件创新来实现的,从而可以高效且相对便宜地处理大量复杂的结构化和非结构化数据。

This occurs through parallel and distributed computing as well as innovations in hardware, allowing for efficient and relatively cheap processing of large volumes of complex structured and unstructured data.

最近的成功神经网络理解大数据
Recent success of neural networks in making sense of big data

人工智能在图像识别和围棋等某些任务上的表现已经超越了人类。当AlexNet在 2012 年赢得ImageNet 大规模视觉识别挑战赛时,它刺激了卷积神经网络(由图形处理单元支持)的无数活动,并且在 2015 年,PReLU-Net ( ResNet ) 首次在图像分类方面超越了人类。

AI has surpassed human performance in certain tasks such as image recognition and the Go game. When AlexNet won the ImageNet Large Scale Visual Recognition Challenge in 2012, it spurred a myriad of activity in convolutional neural networks (supported by graphical processing units), and in 2015, PReLU-Net (ResNet) was the first to outperform humans in image classification.

当我们审视这些因素时,我们意识到今天的人工智能与科幻小说中的人工智能并不一样。当今的人工智能以大数据(各种数据)和机器学习算法为中心,并且主要致力于出色地执行一项任务,而不是开发和适应各种智能类型和目标以响应周围环境。

When we examine these factors, we realize that today’s AI is not the same as science fiction AI. Today’s AI is centered around big data (all kinds of data), machine learning algorithms, and is heavily geared toward performing one task extremely well, as opposed to developing and adapting varied intelligence types and goals as a response to the surrounding environment.

人工智能能做什么?

What Is AI Able to Do?

那里能够成功应用人工智能的领域和行业比能够满足这一不断增长的需求的人工智能专家要多得多。人类一直在努力实现流程自动化,而人工智能有望大规模实现这一目标。大大小小的公司都拥有大量的原始数据,他们希望分析这些数据并将其转化为利润、最佳策略和资源分配的见解。健康行业医生严重短缺,人工智能在健康行业的应用无数,潜力无限。全球金融体系、股票市场和银行业一直在很大程度上依赖于我们做出良好预测的能力,而当这些预测失败时,它们就会遭受巨大损失。随着我们计算能力的不断增强,科学研究取得了显着进展,今天我们正处于一个新的黎明,人工智能的进步使得计算规模在几十年前被认为是不可能的。

There are many more areas and industries where AI can be successfully applied than there are AI experts who are well suited to respond to this ever-growing need. Humans have always strived for automating processes, and AI carries a great promise to do exactly that, at a massive scale. Large and small companies have volumes of raw data that they would like to analyze and turn into insights for profits, optimal strategies, and allocation of resources. The health industry suffers a severe shortage of doctors, and AI has innumerable applications and unlimited potential there. Worldwide financial systems, stock markets, and banking industries have always depended heavily on our ability to make good predictions, and have suffered tremendously when those predictions failed. Scientific research has progressed significantly with our increasing ability to compute, and today we are at a new dawn where advances in AI enable computations at scales thought impossible a few decades ago.

从电网、交通、供应链到森林和野生动物保护,应对世界饥饿、疾病和气候变化,各地都需要高效的系统和运营。自动化甚至在人工智能本身中也受到追捧,人工智能系统会自发地决定最佳的管道、算法和参数,轻松地为给定的任务产生所需的结果,从而完全消除对人类监督的需要。

Efficient systems and operations are needed everywhere, from the power grid, transportation, and the supply chain to forest and wildlife preservation, battling world hunger, disease, and climate change. Automation is even sought after in AI itself, where an AI system spontaneously decides on the optimal pipelines, algorithms, and parameters, readily producing the desired outcomes for given tasks, thus eliminating the need for human supervision altogether.

AI 代理的具体任务

An AI Agent’s Specific Tasks

在这本书中,在进行数学计算时,我将在人工智能代理指定任务的背景下重点关注人工智能的流行应用领域。然而,有益的数学思想和技术可以很容易地跨不同的应用领域转移。这种看似简单且适用性广泛的原因是,我们恰好处于人工智能实施的时代,从某种意义上说,解决某些任务的主要思想已经开发出来,只需稍加调整,它们就可以跨领域实施。各个行业和领域。我们的人工智能主题和/或任务包括:

In this book, as I work through the math, I will focus on popular application areas of AI, in the context of an AI agent’s specified tasks. Nevertheless, the beneficial mathematical ideas and techniques are readily transferable across different application domains. The reason for this seeming easiness and wide applicability is that we happen to be at the age of AI implementation, in the sense that the main ideas for addressing certain tasks have already been developed, and with only a little tweaking, they can be implemented across various industries and domains. Our AI topics and/or tasks include:

模拟数据和真实数据
Simulated and real data

我们的人工智能代理处理数据,提供见解,并根据该数据做出决策(使用数学和算法)。

Our AI agent processes data, provides insights, and makes decisions based on that data (using mathematics and algorithms).

大脑新皮质
The brain neocortex

神经网络人工智能中的模型是按照新皮质或新大脑建模的。这是我们大脑的一部分,负责感知、记忆、抽象思维、语言、自愿身体动作、决策、想象力和意识等高级功能。新皮质有很多层,其中六层大部分是可区分的。它非常灵活并且具有巨大的学习能力。旧爬行动物脑位于新皮质下方,负责情绪和更基本和原始的生存功能,例如呼吸、调节心跳、恐惧、攻击性、性冲动等。旧大脑记录了导致有利或不利感觉的行为和经历,创造了影响我们行为和未来行动的情感记忆。我们的人工智能代理以一种非常基本的方式模拟新皮质,有时甚至模拟旧大脑。

Neural networks in AI are modeled after the neocortex, or the new brain. This is the part of our brain responsible for high functions such as perception, memory, abstract thought, language, voluntary physical action, decision making, imagination, and consciousness. The neocortex has many layers, six of which are mostly distinguishable. It is flexible and has a tremendous learning ability. The old brain and the reptilian brain lie below the neocortex, and are responsible for emotions and more basic and primitive survival functions such as breathing, regulating the heartbeat, fear, aggression, sexual urges, and others. The old brain keeps records of actions and experiences that lead to favorable or unfavorable feelings, creating our emotional memory that influences our behavior and future actions. Our AI agent, in a very basic way, emulates the neocortex and sometimes the old brain.

计算机视觉
Computer vision

我们的人工智能代理通过摄像头、传感器等感知和识别其环境。它可以窥视一切,从我们日常的图片和视频,到核磁共振扫描,一直到遥远星系的图像。

Our AI agent senses and recognizes its environment through cameras, sensors, etc. It peeks into everything, from our daily pictures and videos, to our MRI scans, and all the way into images of distant galaxies.

自然语言处理
Natural language processing

我们的人工智能代理与其环境进行通信,并自动执行繁琐且耗时的任务,例如文本摘要、语言翻译、情感分析、文档分类和排名、为图像添加字幕以及与用户聊天。

Our AI agent communicates with its environment and automates tedious and time-consuming tasks such as text summarization, language translation, sentiment analysis, document classification and ranking, captioning images, and chatting with users.

金融系统
Financial systems

我们的人工智能代理检测日常交易中的欺诈行为,评估贷款风险,并提供 24 小时反馈和有关我们财务习惯的见解。

Our AI agent detects fraud in our daily transactions, assesses loan risks, and provides 24-hour feedback and insights about our financial habits.

网络和图表
Networks and graphs

我们的人工智能代理处理网络和图形数据,例如动物社交网络、基础设施网络、专业协作网络、经济网络、交通网络、生物网络等等。

Our AI agent processes network and graph data, such as animal social networks, infrastructure networks, professional collaboration networks, economic networks, transportation networks, biological networks, and many others.

社交媒体
Social media

我们的人工智能代理要感谢社交媒体为其提供了学习所需的大量数据。作为回报,我们的人工智能代理尝试描绘社交媒体用户的特征,识别他们的模式、行为和活跃网络。

Our AI agent has social media to thank for providing the large amount of data necessary for its learning. In return, our AI agent attempts to characterize social media users, identifying their patterns, behaviors, and active networks.

供应链
The supply chain

我们的人工智能代理是优化专家。它帮助我们预测生产链各个层面的最佳资源需求和分配策略。它还找到了结束世界饥饿的方法。

Our AI agent is an optimizing expert. It helps us predict optimal resource needs and allocation strategies at each level of the production chain. It also finds ways to end world hunger.

日程安排和人员配备
Scheduling and staffing

我们的人工智能代理促进我们的日常运营。

Our AI agent facilitates our daily operations.

天气预报
Weather forecasting

我们的人工智能代理可以求解天气预报和预测中使用的偏微分方程。

Our AI agent solves partial differential equations used in weather forecasting and prediction.

气候变化
Climate change

我们的人工智能代理试图应对气候变化。

Our AI agent attempts to fight climate change.

教育
Education

我们的人工智能代理提供个性化的学习体验。

Our AI agent delivers personalized learning experiences.

伦理
Ethics

我们的AI代理力求公平、公正、包容、透明、公正,并保护数据安全和隐私。

Our AI agent strives to be fair, equitable, inclusive, transparent, unbiased, and protective of data security and privacy.

人工智能的局限性是什么?

What Are AI’s Limitations?

沿着人工智能取得了令人印象深刻的成就,并有望增强或彻底改变整个行业,但该领域仍然存在一些真正的限制。一些最紧迫的限制是:

Along with the impressive accomplishments of AI and its great promise to enhance or revolutionize entire industries, there are some real limitations that the field needs to overcome. Some of the most pressing limitations are:

智力
Intelligence

目前的人工智能是就我们人类认为自己独特的智能而言,它距离智能还差得很远。尽管人工智能在无数任务中表现优于人类,但它无法自然地切换和适应新任务。例如,经过训练可以识别图像中的人类的人工智能系统如果不重新训练就无法识别猫,或者在不改变其架构和算法的情况下生成文本。在这三种类型的人工智能的背景下,我们迄今为止只部分地了解了完成了狭义人工智能,其能力范围很窄。我们既没有实现与人类能力相当的通用人工智能,也没有实现比人类能力更强的超级人工智能。此外,今天的机器无法体验人类任何美好的情感,例如爱、亲密、幸福、骄傲、尊严、关心、悲伤、失落等等。模仿情绪与体验和真实提供情绪不同。从这个意义上说,机器还远不能取代人类。

Current AI is not even remotely close to being intelligent in the sense that we humans consider ourselves uniquely intelligent. Even though AI has outperformed humans in innumerable tasks, it cannot naturally switch and adapt to new tasks. For example, an AI system trained to recognize humans in images cannot recognize cats without retraining, or generate text without changing its architecture and algorithms. In the context of the three types of AI, we have thus far only partially accomplished artificial narrow intelligence, which has a narrow range of abilities. We have accomplished neither artificial general intelligence, on par with human abilities, nor artificial super intelligence, which is more capable than humans’. Moreover, machines today are incapable of experiencing any of the beautiful human emotions, such as love, closeness, happiness, pride, dignity, caring, sadness, loss, and many others. Mimicking emotions is different than experiencing and genuinely providing them. In this sense, machines are nowhere close to replacing humans.

大量标记数据
Large volumes of labeled data

最多流行的人工智能应用需要大量的标记数据,例如,MRI 图像可以标记为癌症或非癌症,YouTube 视频可以标记为对儿童安全或不安全,或者房价可以包含房屋区域、卧室数量、家庭收入中位数和其他特征——在这种情况下,房价就是标签。其局限性在于,训练系统所需的数据通常不容易获得,而且获取、标记、维护或存储的成本也不菲。大量数据是机密的、无组织的、非结构化的、有偏见的、不完整的和未标记的。获取数据、整理数据、预处理数据并为其添加标签成为需要大量时间和资源投入的主要障碍。

Most popular AI applications need large volumes of labeled data, for example, MRI images can be labeled cancer or not-cancer, YouTube videos can be labeled safe for children or unsafe, or house prices can be available with the house district, number of bedrooms, median family income, and other features—in this case the house price is the label. The limitation is that the data required to train a system is usually not readily available, and not cheap to obtain, label, maintain, or warehouse. A substantial amount of data is confidential, unorganized, unstructured, biased, incomplete, and unlabeled. Obtaining the data, curating it, preprocessing it, and labeling it become major obstacles requiring large time and resource investments.

多种方法和超参数
Multiple methods and hyperparameters

对于某个AI任务,有时会出现许多方法或算法来完成它。每个任务、数据集和/或算法都有参数,称为超参数,可以在实现过程中进行调整,并且并不总是清楚这些超参数的最佳值是多少。可用于处理特定人工智能任务的方法和超参数多种多样,这意味着不同的方法可能会产生截然不同的结果,并且由人类来评估依赖哪些方法的决策。在某些应用中,例如向特定客户推荐哪种着装风格,这些差异可能无关紧要。在其他领域,基于人工智能的决策可能会改变生活:患者被告知他们没有患有某种疾病,而实际上他们患有某种疾病;囚犯被错误地标记为极有可能再次犯罪,并因此被拒绝假释;或者合格人员的贷款被拒绝。关于如何解决这些问题的研究正在进行中,随着本书的进展,我将详细阐述这些问题。

For a certain AI task, there are sometimes many methods, or algorithms, to accomplish it. Each task, data set, and/or algorithm has parameters, called hyperparameters, that can be tuned during implementation, and it is not always clear what the best values for these hyperparameters are. The variety of methods and hyperparameters available to tackle a specific AI task mean that different methods can produce extremely different results, and it is up to humans to assess which methods’ decisions to rely on. In some applications, such as which dress styles to recommend for a certain customer, these discrepancies may be inconsequential. In other areas, AI-based decisions can be life-changing: a patient is told they do not have a certain disease, while in fact they do; an inmate is mislabeled as highly likely to reoffend and gets their parole denied as a consequence; or a loan gets rejected for a qualified person. Research is ongoing on how to address these issues, and I will expand on them as we progress through the book.

资源限制
Resource limitations

人类能力和潜力仅限于我们的脑力、我们生物体的能力以及我们能够操纵的地球和宇宙中可用的资源。这些又受到我们大脑的力量和容量的限制。人工智能系统同样受到支持人工智能软件的系统的计算能力和硬件能力的限制。最近的研究表明,计算密集型深度学习正在接近其计算极限,需要新的想法来提高算法和硬件效率,或者发现全新的方法。人工智能的进步在很大程度上取决于计算能力的大幅提高。然而,这种电力并不是无限的,对于处理海量数据集的大型系统来说成本极高,并且具有不可忽视的大量碳足迹,例如,运行和冷却数据仓库、单个设备、保持数据所需的电力。云连接等。此外,数据和算法软件并不存在于真空中。计算机、手机、平板电脑、电池等设备以及存储、传输和处理数据和算法所需的仓库和系统都是由从地球收获的真实物理材料制成的。地球花了数百万年的时间来制造其中一些材料,而永远维持这些技术所需的无限供应类型并不存在。

Human abilities and potential are limited to our brainpower, the capacity of our biological bodies, and the resources available on Earth and in the universe that we are able to manipulate. These are again limited by the power and capacity of our brains. AI systems are similarly limited by the computing power and hardware capability of the systems supporting the AI software. Recent studies have suggested that computation-intensive deep learning is approaching its computational limits, and new ideas are needed to improve algorithm and hardware efficiency, or discover entirely new methods. Progress in AI has heavily depended on large increases in computing power. This power, however, is not unlimited, is extremely costly for large systems processing massive data sets, and has a substantial carbon footprint that cannot be ignored, for example, the power required to run and cool down data warehouses, individual devices, keep the cloud connected, etc. Moreover, data and algorithmic software do not exist in the vacuum. Devices such as computers, phones, tablets, batteries, and the warehouses and systems needed to store, transfer, and process data and algorithms are made of real physical materials harvested from Earth. It took Earth millions of years to make some of these materials, and the type of infinite supply required to forever sustain these technologies is just not there.

安保费用
Security costs

安全,隐私和对抗性攻击仍然是人工智能的主要关注点,特别是随着互联系统的出现。大量的研究和资源正在被分配来解决这些重要问题。由于目前的人工智能大部分是软件,大部分数据都是数字化的,因此这一领域的军备竞赛永无休止。这意味着人工智能系统需要不断监控和更新,需要更昂贵的雇佣人工智能和网络安全专家,其成本可能会违背大规模自动化的最初目的。

Security, privacy, and adversarial attacks remain a primary concern for AI, especially with the advent of interconnected systems. A lot of research and resources are being allocated to address these important issues. Since most of the current AI is software and most of the data is digital, the arms race in this area is never-ending. This means that AI systems need to be constantly monitored and updated, requiring more expensive-to-hire AI and cybersecurity specialists, probably at a cost that defeats the initial purpose of automation at scale.

更广泛的影响
Broader impacts

迄今为止,人工智能研究和实施行业认为自己与其先进技术所带来的经济、社会和安全后果略有不同。通常,人工智能工作的这些道德、社会和安全影响被认为很重要并且需要关注,但超出了工作本身的范围。随着人工智能得到广泛部署,人们越来越强烈地感受到它对社会结构和性质、市场和潜在威胁的影响,整个领域必须更加有意识地处理这些至关重要的问题。从这个意义上说,人工智能开发社区分配的资源有限,无法解决实施和部署新技术的更广泛影响。技术。

The AI research and implementation industries have thus far viewed themselves as slightly separate from the economical, social, and security consequences of their advancing technologies. Usually these ethical, social, and security implications of the AI work are acknowledged as important and needing to be attended to, but beyond the scope of the work itself. As AI becomes widely deployed and its impacts on the fabric and nature of society, markets, and potential threats are felt more strongly, the field as a whole has to become more intentional in the way it attends to these issues of paramount importance. In this sense, the AI development community has been limited in the resources it allocates to addressing the broader impacts of the implementation and deployment of its new technologies.

当人工智能系统失败时会发生什么?

What Happens When AI Systems Fail?

一个非常了解人工智能的一个重要部分是了解它的事件和失败。这有助于我们在设计自己的人工智能并部署到现实世界之前预见并避免类似的结果。如果人工智能在部署后出现故障,后果可能是极其不可取的、危险的,甚至是致命的。

A very important part of learning about AI is learning about its incidents and failures. This helps us foresee and avoid similar outcomes when designing our own AI, before deploying out into the real world. If the AI fails after being deployed, the consequences can be extremely undesirable, dangerous, or even lethal.

一个人工智能故障的在线存储库,称为人工智能事件数据库,包含一千多个此类事件。该网站的示例包括:

One online repository for AI failures, called the AI Incident Database, contains more than a thousand such incidents. Examples from this website include:

  • 自动驾驶汽车撞死行人。

  • A self-driving car kills a pedestrian.

  • 在旧金山的街道上,自动驾驶汽车与公司服务器失去联系整整 20 分钟,然后全部熄火(2022 年 6 月 28 日和 5 月 18 日)。

  • Self-driving cars lose contact with their company’s server for a full 20 minutes and all stall at once in the streets of San Francisco (June 28 and May 18 of 2022).

  • 交易算法导致市场闪崩,数十亿美元在各方之间自动转移。

  • A trading algorithm causes a market flash crash where billions of dollars automatically transfer between parties.

  • 面部识别系统导致无辜者被捕。

  • A facial recognition system causes an innocent person to be arrested.

  • 微软臭名昭著的聊天机器人 Tay 在发布后仅 16 小时就被关闭,因为它很快就学会并在推特上发布了攻击性、种族主义和高度煽动性的言论。

  • Microsoft’s infamous chatbot Tay is shut down only 16 hours after its release, since it quickly learned and tweeted offensive, racist, and highly inflammatory remarks.

这种不良后果是可以减轻的,但需要深入了解这些系统在各个生产级别的工作方式,以及它们部署的环境和用户。理解人工智能背后的数学原理是这一洞察过程中的关键一步。

Such bad outcomes can be mitigated but require a deep understanding of how these systems work, at all levels of production, as well as of the environment and users they are deployed for. Understanding the mathematics behind AI is one crucial step in this discerning process.

人工智能将走向何方?

Where Is AI Headed?

成为要回答或推测人工智能的发展方向,最好回顾一下该领域自诞生以来的最初目标:模仿人类智能。这个领域的构想是在五十年代。审视它过去七十年的历程也许可以告诉我们一些关于它未来的方向。此外,研究该领域的历史及其趋势使我们能够鸟瞰人工智能,将一切放在背景中并提供更好的视角。这也使得学习人工智能所涉及的数学不再是一种令人难以承受的体验。以下是对人工智能演变的非常简短且非技术性的概述,以及由于深度学习最近取得的令人印象深刻的进展,它最终成为人们关注的焦点。

To be able to answer, or speculate on, where AI is headed, it is best to recall the field’s original goal since its inception: mimic human intelligence. This field was conceived in the fifties. Examining its journey over the past seventy years might tell us something about its future direction. Moreover, studying the history of the field and its trends enables us to have a bird’s-eye view of AI, putting everything in context and providing a better perspective. This also makes learning the mathematics involved in AI a less overwhelming experience. The following is a very brief and nontechnical overview of AI’s evolution and its eventual thrust into the limelight thanks to the recent impressive progress of deep learning.

一开始,人工智能研究试图使用规则和逻辑来模仿智能。我们的想法是,我们需要做的就是向机器提供事实和推理这些事实的逻辑规则(我们将在第 12 章中看到这种逻辑结构的示例)。没有强调学习过程。这里的挑战是,为了捕获人类知识,编码员难以处理的规则和约束太多,而且这种方法似乎不可行。

In the beginning, AI research attempted to mimic intelligence using rules and logic. The idea was that all we needed to do is feed machines facts and logical rules of reasoning about these facts (we will see examples of this logical structure in Chapter 12). There was no emphasis on the learning process. The challenge here was that, in order to capture human knowledge, there are too many rules and constraints to be tractable for a coder, and the approach seemed unfeasible.

在 20 世纪 90 年代末和 2000 年代初,各种机器学习方法开始流行。机器学习不是对规则进行编程,并根据这些预先编程的规则做出结论和决策,而是从数据中推断出规则。机器学习系统能够处理和处理的数据越多,其性能就越好。数据以及经济有效地处理和学习大量数据的能力成为主要目标。当时流行的机器学习算法有支持向量机、贝叶斯网络、进化算法、决策树、随机森林、回归、逻辑回归等。这些算法现在仍然很流行。

In the late 1990s and the early 2000s, various machine learning methods became popular. Instead of programming the rules, and making conclusions and decisions based on these preprogrammed rules, machine learning infers the rules from the data. The more data a machine learning system is able to handle and process, the better its performance. Data and the ability to process and learn from large amounts of data economically and efficiently became the main goals. Popular machine learning algorithms in that time period were support vector machines, Bayesian networks, evolutionary algorithms, decision trees, random forests, regression, logistic regression, and others. These algorithms are still popular now.

2010年之后,特别是2012年,掀起了一股AlexNet 的卷积神经网络在图像识别领域取得成功后,神经网络和深度学习接手了这一领域。

After 2010, and particularly in 2012, a tidal wave of neural networks and deep learning took over after the success of AlexNet’s convolutional neural network in image recognition.

最近,在过去的五年里,DeepMind 的 AlphaGo 在非常复杂的中国古代围棋游戏中击败世界冠军后,强化学习开始流行起来。

Most recently, in the last five years, reinforcement learning gained popularity after DeepMind’s AlphaGo beat the world champion in the very complicated ancient Chinese game of Go.

请注意,对历史的一瞥是非常粗略的:自 1800 年代初期勒让德和高斯以来,回归就已经存​​在,第一个人工神经元和神经网络是在 1940 年代末和 1950 年代初通过神经生理学家的工作制定的沃伦·麦卡洛克、数学家沃尔特·皮茨、以及心理学家唐纳德·赫布和弗兰克·罗森布拉特。图灵测试最初被称为模仿游戏,由计算机科学家、密码分析师、数学家和理论生物学家艾伦·图灵于 1950 年在他的论文《计算机器与智能》中提出。图灵提出,如果机器的反应与人类的反应无法区分,那么它就拥有人工智能。因此,如果一台机器能够模仿人类的反应,那么它就被认为是智能的。然而,对于计算机科学领域之外的人来说,图灵测试听起来对智能的定义有限制,我想知道图灵测试是否可能无意中限制了人工智能研究的目标或方向。

Note that this glimpse of history is very rough: regression has been around since Legendre and Gauss in the very early 1800s, and the first artificial neurons and neural networks were formulated in the late 1940s and early 1950s with the works of neurophysiologist Warren McCulloch, mathematician Walter Pitts, and psychologists Donald Hebb and Frank Rosenblatt. The Turing Test, originally called the Imitation Game, was introduced in 1950 by Alan Turing, a computer scientist, cryptanalyst, mathematician, and theoretical biologist, in his paper “Computing Machinery and Intelligence”. Turing proposed that a machine possesses artificial intelligence if its responses are indistinguishable from those of a human. Thus, a machine is considered intelligent if it is able to imitate human responses. The Turing Test, however, for a person outside the field of computer science, sounds limiting in its definition of intelligence, and I wonder if the Turing Test might have inadvertently limited the goals or the direction of AI research.

尽管机器能够在某些特定任务中模仿人类智能,但复制人类智能的最初目标尚未实现,因此可以安全地假设这就是该领域的发展方向,尽管它可能涉及重新发现旧想法或发明全新的。目前该领域的投资水平,加上研究和公众兴趣的爆炸性增长,必将产生新的突破。尽管如此,最近人工智能进步带来的突破已经正在彻底改变渴望实施这些技术的整个行业。这些当代人工智能的进步涉及大量重要的数学知识,我们将在整个过程中进行探索这本书。

Even though machines are able to mimic human intelligence in some specific tasks, the original goal of replicating human intelligence has not been accomplished yet, so it might be safe to assume that is where the field is headed, even though it could involve rediscovering old ideas or inventing entirely new ones. The current level of investment in the area, combined with the explosion in research and public interest, are bound to produce new breakthroughs. Nonetheless, breakthroughs brought about by recent AI advancements are already revolutionizing entire industries eager to implement these technologies. These contemporary AI advancements involve plenty of important mathematics that we will be exploring throughout this book.

目前AI领域的主要贡献者有哪些?

Who Are the Current Main Contributors to the AI Field?

主要的人工智能竞赛在美国、欧洲和中国之间进行。科技行业的一些世界领先者包括谷歌及其母公司 Alphabet、亚马逊、Facebook、微软、英伟达和美国的 IBM,英国和美国的 DeepMind(由 Alphabet 所有),以及百度和腾讯在中国。学术界也有主要贡献者,但数量太多,无法一一列举。如果您是该领域的新手,最好了解一下大人物的名字、他们的历史和贡献以及他们目前追求的目标类型。了解围绕他们的工作的争议(如果有的话)也很有价值。当您浏览人工智能并获得更多经验时,这些常识会派上用场。

The main AI race has been between the United States, Europe, and China. Some of the world leaders in the technology industry have been Google and its parent company Alphabet, Amazon, Facebook, Microsoft, Nvidia, and IBM in the United States, DeepMind in the UK and the United States (owned by Alphabet), and Baidu and Tencent in China. There are major contributors from the academic world as well, but these are too many to enumerate. If you are new to the field, it is good to know the names of the big players, their histories and contributions, and the kinds of goals they are currently pursuing. It is also valuable to learn about the controversies, if any, surrounding their work. This general knowledge comes in handy as you navigate through and gain more experience in AI.

人工智能通常涉及哪些数学?

What Math Is Typically Involved in AI?

当我说“数学”这个词时,您会想到什么主题?

When I say the word “math,” what topics and subjects come to your mind?

不管你无论您是数学专家还是初学者,您想到回答问题的任何数学主题都最有可能涉及人工智能。以下是人工智能实施中最有用的数学科目的常用列表:微积分、线性代数、优化、概率和统计学;然而,您并不需要成为所有这些领域的专家才能在人工智能领域取得成功。您真正需要的是对从这些数学科目中提取的某些有用主题有深入的了解。根据您的具体应用领域,您可能需要以下特殊主题:随机矩阵理论、图论、博弈论、微分方程和运筹学。

Whether you are a math expert or a beginner, whatever math topic that you thought of to answer the question is most likely involved in AI. Here is a commonly used list of the most useful math subjects for AI implementation: calculus, linear algebra, optimization, probability, and statistics; however, you do not need to be an expert in all of these fields to succeed in AI. What you do need is a deep understanding of certain useful topics drawn from these math subjects. Depending on your specific application area, you might need special topics from: random matrix theory, graph theory, game theory, differential equations, and operations research.

在本书中,我们将逐步介绍这些主题,但不会针对每个主题提供教科书。人工智能的应用和实施是这些不同且密切相互作用的数学学科的统一主题。使用这种方法,我可能会通过简化大量技术定义或省略整个定理和微妙细节而冒犯一些数学专家,而且我也可能会冒犯人工智能或专业行业专家,再次省略某些应用和实现中涉及的细节。然而,我们的目标是保持本书简单易读,同时涵盖对人工智能应用重要的大部分数学主题。想要深入研究数学或人工智能领域的感兴趣的读者可以阅读更多关于他们想要关注的特定领域的相关书籍。我希望这本书是一个简洁的总结和全面的概述,因此读者可以自信地扩展到他们感兴趣的任何人工智能数学领域或人工智能应用领域。

In this book we will walk through these topics without presenting a textbook on each one. AI application and implementation are the unifying themes for these varied and intimately interacting mathematical subjects. Using this approach, I might offend some math experts by simplifying a lot of technical definitions or omitting whole theorems and delicate details, and I might as well offend AI or specialized industry experts, again omitting details involved in certain applications and implementations. The goal, however, is to keep the book simple and readable, while at the same time covering most of the math topics that are important for AI applications. Interested readers who want to dive deeper into the math or the AI field can then read more involved books on the particular area they want to focus on. My hope is that this book is a concise summary and a thorough overview, hence a reader can afterward branch out confidently to whatever AI math field or AI application area interests them.

总结与展望

Summary and Looking Ahead

人类智能体现在感知、视觉、通过自然语言进行的交流、推理、决策、协作、同理心、建模和操纵周围环境、跨人群和几代人的技能和知识转移、以及将先天和习得的技能推广到新的和新的技能中。未知领域。人工智能渴望复制人类智能的各个方面。在目前的状态下,人工智能一次只能解决智能的一个或几个方面。即使存在这种限制,人工智能仍然能够完成令人印象深刻的壮举,例如模拟蛋白质折叠和预测蛋白质结构,这些是生命的基石。这一人工智能应用(以及众多应用)对于理解生命本质和对抗各种疾病的影响是无限的。

Human intelligence reveals itself in perception, vision, communication through natural language, reasoning, decision making, collaboration, empathy, modeling and manipulating the surrounding environment, transfer of skills and knowledge across populations and generations, and generalization of innate and learned skills into new and uncharted domains. Artificial intelligence aspires to replicate all aspects of human intelligence. In its current state, AI addresses only one or few aspects of intelligence at a time. Even with this limitation, AI has been able to accomplish impressive feats, such as modeling protein folding and predicting protein structures, which are the building blocks of life. The implications of this one AI application (among many) for understanding the nature of life and battling all kinds of diseases are boundless.

当你进入人工智能领域时,重要的是要时刻注意你正在开发或使用智能的哪一方面。是感知吗?想象?自然语言?导航?控制?推理?关注哪些数学以及为什么关注自然就会遵循,因为你已经知道自己在人工智能领域的位置。这样就可以很容易地关注开发人工智能特定方面的社区所使用的数学方法和工具。本书中的秘诀是相似的:首先是人工智能类型和应用,然后是数学。

When you enter the AI field, it is important to remain mindful of which aspect of intelligence you are developing or using. Is it perception? Vision? Natural language? Navigation? Control? Reasoning? Which mathematics to focus on and why then follow naturally, since you already know where in the AI field you are situated. It will then be easy to attend to the mathematical methods and tools used by the community developing that particular aspect of AI. The recipe in this book is similar: first the AI type and application, then the math.

在本章中,我们解决了一般性问题。什么是人工智能?AI能做什么?人工智能的局限性是什么?人工智能将走向何方?人工智能是如何运作的?我们还简要调查了重要的人工智能应用、试图将人工智能集成到系统中的公司通常遇到的问题、系统实施不佳时发生的事件,以及人工智能实施通常需要的数学科目。

In this chapter, we addressed general questions. What is AI? What is AI able to do? What are AI’s limitations? Where is AI headed? How does AI work? We also briefly surveyed important AI applications, the problems usually encountered by companies trying to integrate AI into their systems, incidents that happen when systems are not well implemented, and the math subjects typically needed for AI implementations.

在下一章中,我们将深入研究数据并确认其与人工智能的密切关系。当我们谈论数据时,我们也谈论数据分布,这让我们直接陷入概率论和统计学。

In the next chapter, we dive into data and affirm its intimate relationship to AI. When we talk data, we also talk data distributions, and that plunges us straight into probability theory and statistics.

第 2 章数据,数据,数据

Chapter 2. Data, Data, Data

也许如果我知道这一切从何而来,为什么,我就会知道这一切将去往何方,以及为什么……

H。

Maybe if I know where it all came from, and why, I would know where it’s all headed, and why.

H.

数据是大多数人工智能系统的燃料。在本章中,我们将了解数据以及设计从数据中提取有用且可操作信息的方法如何成为感知人工智能的核心。

Data is the fuel that powers most AI systems. In this chapter, we will understand how data, and devising methods for extracting useful and actionable information from data, is at the heart of perception AI.

感知人工智能基于数据的统计学习,人工智能代理或机器从其环境中感知数据,然后检测该数据中的模式,从而得出结论和/或做出决策。

Perception AI is based on statistical learning from data, where an AI agent, or a machine, perceives data from its environment, then detects patterns within this data, allowing it to draw conclusions and/or make decisions.

感知人工智能与其他三种类型的人工智能不同:

Perception AI is different from the three other types of AI:

了解人工智能
Understanding AI

凡是人工智能系统理解,它分类为椅子的图像具有坐的功能,它分类为癌症的图像意味着这个人生病了,需要进一步的医疗护理,或者它阅读的线性代数教科书可以用来提取有用的信息从数据来看。

Where an AI system understands that the image it classified as a chair serves the function of sitting, the image it classified as cancer means that the person is sick and needs further medical attention, or the textbook it read about linear algebra can be used to extract useful information from data.

控制人工智能
Control AI

与控制人工智能代理的物理部分有关,以便导航空间、开门、提供咖啡等。机器人技术在这一领域取得了重大进展。我们需要为机器人配备“大脑”,包括感知人工智能和理解人工智能,并将它们与控制人工智能连接起来。理想情况下,像人类一样,控制人工智能然后通过将信息传递给感知和理解系统,从与环境的物理交互中学习,而感知和理解系统又将控制命令传递给代理的控制系统。

This has to do with controlling the physical parts of the AI agent in order to navigate spaces, open doors, serve coffee, etc. Robotics have made significant progress in this area. We need to augment robots with “brains” that include perception AI and understanding AI, and connect those to the control AI. Ideally, like humans, control AI then learns from its physical interactions with its environment by passing that information to its perception and understanding systems, which in turn pass control commands to the agent’s control systems.

意识人工智能
Awareness AI

凡是AI智能体拥有与人类相似的内心体验。由于我们还不知道如何在数学上定义意识,因此我们在本书中根本不会涉及这个概念。

Where an AI agent has an inner experience similar to the human experience. Since we do not know yet how to mathematically define awareness, we do not visit this concept at all in this book.

理想情况下,真正的类人智能结合了所有四个方面:感知、理解、控制和意识。本章和接下来几章的主要焦点是感知人工智能。回想一下,人工智能和数据已经交织在一起,以至于现在将数据科学和人工智能作为同义词使用是很常见的,尽管这是错误的。

Ideally, true human-like intelligence combines all four aspects: perception, understanding, control, and awareness. The main focus of this chapter and the next few chapters is perception AI. Recall that AI and data have become intertwined to the extent that it is now common, though erroneous, to use the terms data science and AI synonymously.

人工智能数据

Data for AI

许多流行的机器学习模型(包括 2012 年 AlexNet 使人工智能重新成为人们关注的焦点的非常成功的神经网络)的核心在于一个非常简单的数学问题:

At the core of many popular machine learning models, including the highly successful neural networks that brought artificial intelligence back into the popular spotlight with AlexNet in 2012, lies a very simple mathematical problem:

将给定的一组数据点拟合到适当的函数(将输入映射到输出)中,该函数拾取数据中的重要信号并忽略噪声,然后确保该函数在新数据上表现良好。

Fit a given set of data points into an appropriate function (mapping an input to an output) that picks up on the important signals in the data and ignores the noise, then make sure this function performs well on new data.

然而,复杂性和挑战来自多个方面:

Complexity and challenges, however, arise from various sources:

假设和特征
Hypothesis and features

生成数据的真实函数及其实际依赖的所有特征都是未知的。我们只是观察数据,然后尝试估计生成它的假设函数。我们的函数试图了解数据的哪些特征对于我们的预测、分类、决策或一般目的很重要。它还学习这些特征如何相互作用以产生观察到的结果。在这种情况下,人工智能的巨大潜力之一是它能够捕捉人类通常无法捕捉到的数据特征之间的微妙相互作用,因为我们非常擅长观察强大的特征,但可能会错过更微妙的特征。例如,我们作为人类可以知道一个人的月收入会影响他们偿还贷款的能力,但我们可能不会观察到他们的日常通勤或早上的例行公事也可能对此产生不小的影响。某些特征交互比其他特征交互简单得多,例如线性交互。其他的则更复杂并且是非线性的。从数学的角度来看,无论我们的特征交互是简单(线性)还是复杂(非线性),我们仍然有相同的目标:找到适合您的数据并能够对新数据做出良好预测的假设函数。这里出现了一个额外的复杂问题:有许多假设函数可以拟合相同的数据集——我们如何知道选择哪些函数?

Neither the true function that generated the data nor all the features it actually depends on are known. We simply observe the data, then try to estimate a hypothetical function that generated it. Our function tries to learn which features of the data are important for our predictions, classifications, decisions, or general purposes. It also learns how these features interact in order to produce the observed results. One of the great potentials of AI in this context is its ability to pick up on subtle interactions between features of data that humans do not usually pick up on, since we are very good at observing strong features but may miss more subtle ones. For example, we as humans can tell that a person’s monthly income affects their ability to pay back a loan, but we might not observe that their daily commute, or morning routine, may have a nontrivial effect on that as well. Some feature interactions are much simpler than others, such as linear interactions. Others are more complex and are nonlinear. From a mathematical point of view, whether our feature interactions are simple (linear) or complex (nonlinear), we still have the same goal: find the hypothetical function that fits your data and is able to make good predictions on new data. One extra complication arises here: there are many hypothetical functions that can fit the same data set—how do we know which ones to choose?

表现
Performance

即使在计算出适合我们数据的假设函数之后,我们如何知道它是否能够在新的和未见过的数据上表现良好?我们如何知道选择哪种性能衡量标准,以及如何在部署到现实世界后监控该性能?现实世界的数据和场景不会全部贴上基本事实标签,因此我们无法轻松衡量我们的人工智能系统是否表现良好并做出正确或适当的预测和决策。我们不知道用什么来衡量人工智能系统的结果。如果现实世界的数据和场景被贴上了基本事实的标签,那么我们都会破产,因为我们知道在每种情况下该做什么,地球上会有和平,我们会从此过上幸福的生活(不是真的,我希望事情就是这么简单)。

Even after computing a hypothetical function that fits our data, how do we know whether it will perform well on new and unseen data? How do we know which performance measure to choose, and how to monitor this performance after deploying into the real world? Real-world data and scenarios do not come to us all labeled with ground truths, so we cannot easily measure whether our AI system is doing well and making correct or appropriate predictions and decisions. We do not know what to measure the AI system’s results against. If real-world data and scenarios were labeled with ground truths, then we would all be out of business since we would know what to do in every situation, there would be peace on Earth, and we would live happily ever after (not really, I wish it was that simple).

体积
Volume

AI领域几乎所有的东西都是非常高维的!数据实例、观察到的特征和要计算的未知权重的数量可能为数百万,所需的计算步骤为数十亿。对此类数据进行高效存储、传输、探索、预处理、结构化和计算成为中心目标。此外,探索所涉及的高维数学函数的景观是一项不平凡的努力。

Almost everything in the AI field is very high-dimensional! The number of data instances, observed features, and unknown weights to be computed could be in the millions, and the required computation steps in the billions. Efficient storage, transport, exploration, preprocessing, structuring, and computation on such volumes of data become center goals. In addition, exploring the landscapes of the involved high-dimensional mathematical functions is a nontrivial endeavor.

结构
Structure

现代世界创建的大多数数据都是非结构化的。它不是以易于查询的表的形式组织的,这些表包含诸如姓名、电话号码、性别、年龄、邮政编码、房价、收入水平等标记字段。非结构化数据无处不在:社交媒体上的帖子、用户活动、 Word 文档、PDF 文件、图像、音频和视频文件、协作软件数据、交通或地震或天气数据、GPS、军事行动、电子邮件、即时消息、移动聊天数据等等。其中一些示例(例如电子邮件数据)可以被视为半结构化,因为电子邮件带有包含电子邮件元数据的标题:发件人、收件人、日期、时间、主题、内容类型、垃圾邮件状态等。重要数据无法以数字格式提供,并且分散在多个非通信数据库中。这里的例子包括历史军事数据、博物馆档案和医院记录。目前,我们的世界和城市数字化以利用更多人工智能应用的势头强劲。总体而言,从结构化和标记数据中获取见解比从非结构化数据中更容易。挖掘非结构化数据需要创新技术,这些技术目前是数据科学、机器学习和人工智能领域的驱动力智力。

Most of the data created by the modern world is unstructured. It is not organized in easy-to-query tables that contain labeled fields such as names, phone numbers, genders, ages, zip codes, house prices, income levels, etc. Unstructured data is everywhere: posts on social media, user activity, word documents, PDF files, images, audio and video files, collaboration software data, traffic or seismic or weather data, GPS, military movement, emails, instant messenger, mobile chat data, and many others. Some of these examples, such as email data, can be considered semistructured, since emails come with headings that include the email’s metadata: From, To, Date, Time, Subject, Content-Type, Spam Status, etc. Moreover, large volumes of important data are not available in digital format and are fragmented over multiple and noncommunicating databases. Examples here include historical military data, museum archives, and hospital records. Presently, there is great momentum toward digitalizing our world and our cities to leverage more AI applications. Overall, it is easier to draw insights from structured and labeled data than from unstructured data. Mining unstructured data requires innovative techniques that are currently driving forces in the fields of data science, machine learning, and artificial intelligence.

真实数据与模拟数据

Real Data Versus Simulated Data

当我们处理数据时,了解真实数据和模拟数据之间的区别非常重要。两种类型的数据对于人类的发现和进步都极其有价值。

When we work with data, it is very important to know the difference between real data and simulated data. Both types of data are extremely valuable for human discovery and progress.

真实数据
Real data

数据是通过现实世界的观察收集的,使用测量设备、传感器、调查、结构化形式,如医学问卷、望远镜、成像设备、网​​站、股票市场、对照实验等。由于不准确和失败,这些数据通常是不完美和有噪音的在测量方法和仪器方面。从数学上讲,我们不知道生成真实数据的确切函数或概率分布,但我们可以使用模型、理论和模拟对其进行假设。然后我们可以测试我们的模型,最后用它们来做出预测。

This data is collected through real-world observations, using measuring devices, sensors, surveys, structured forms like medical questionnaires, telescopes, imaging devices, websites, stock markets, controlled experiments, etc. This data is often imperfect and noisy due to inaccuracies and failures in measuring methods and instruments. Mathematically, we do not know the exact function or probability distribution that generated the real data, but we can hypothesize about it using models, theories, and simulations. We can then test our models, and finally use them to make predictions.

模拟数据
Simulated data

是使用已知函数生成的数据或从已知概率分布中随机采样的数据。在这里,我们有已知的数学函数或模型,并将数值插入模型中以生成数据点。例子很多:偏微分方程的数值解在各种尺度上模拟各种自然现象,例如湍流、蛋白质折叠、热扩散、化学反应、行星运动、断裂材料、交通,甚至增强迪士尼电影动画,例如模拟莫阿娜中的自然水运动或冰雪奇缘中艾莎的头发运动。

This is data generated using a known function or randomly sampled from a known probability distribution. Here, we have our known mathematical function(s), or model, and we plug numerical values into the model to generate our data points. Examples are plentiful: numerical solutions of partial differential equations modeling all kinds of natural phenomena on all kinds of scales, such as turbulent flows, protein folding, heat diffusion, chemical reactions, planetary motion, fractured materials, traffic, and even and even enhancing Disney movie animations, such as simulating natural water movement in Moana or Elsa’s hair movement in Frozen.

在本章中,我们将介绍两个有关人体身高和体重数据的示例,以展示真实数据和模拟数据之间的差异。在第一个示例中,我们访问一个在线公共数据库,然后下载并探索两个包含真实个体身高和体重测量值的真实数据集。在第二个示例中,我们根据我们假设的函数模拟我们自己的身高和体重数据集:我们假设一个人的体重与其身高线性相关。这意味着当我们根据身高数据绘制体重数据时,我们期望看到直线或平坦的视觉图案。

In this chapter, we present two examples about human height and weight data to demonstrate the difference between real and simulated data. In the first example, we visit an online public database, then download and explore two real data sets containing measurements of the heights and weights of real individuals. In the second example, we simulate our own data set of heights and weights based on a function that we hypothesize: we assume that the weight of an individual depends linearly on their height. This means that when we plot the weight data against the height data, we expect to see a straight, or flat, visual pattern.

数学模型:线性与非线性

Mathematical Models: Linear Versus Nonlinear

线性依赖关系模型世界的平坦度,如一维直线、二维平面(称为planes),和更高维的超平面。对线性相关性进行建模的线性函数图永远是平坦的并且不会弯曲。每当你看到一个平面物体,比如桌子、一根杆、一个天花板,或者一堆围绕一条直线或平面挤在一起的数据点时,你就知道它们的代表函数是线性的。任何不平坦的东西都是非线性的,因此图形弯曲的函数是非线性的,并且聚集在弯曲曲线或曲面周围的数据点是由非线性函数生成的。

Linear dependencies model flatness in the world, like one-dimensional straight lines, two-dimensional flat surfaces (called planes), and higher-dimensional hyperplanes. The graph of a linear function, which models a linear dependency, is forever flat and does not bend. Every time you see a flat object, like a table, a rod, a ceiling, or a bunch of data points huddled together around a straight line or a flat surface, know that their representative function is linear. Anything that isn’t flat is nonlinear, so functions whose graphs bend are nonlinear, and data points that congregate around bending curves or surfaces are generated by nonlinear functions.

线性函数的公式表示函数输出对特征变量线性依赖性,很容易写下来。这些特征仅以自身形式出现在公式中,没有幂或根,并且不嵌入任何其他函数中,例如分数分母、正弦、余弦、指数、对数或其他微积分函数。它们只能相乘标量(实数或复数,而不是向量或矩阵)计算,并相互相加或相减。例如,线性依赖于三个特征的函数 X 1 , X 2 , 和 X 3 可以写成:

The formula for a linear function, representing a linear dependency of the function output on the features, or variables, is very easy to write down. The features appear in the formula as just themselves, with no powers or roots, and are not embedded in any other functions, such as denominators of fractions, sine, cosine, exponential, logarithmic, or other calculus functions. They can only be multiplied by scalars (real or complex numbers, not vectors or matrices), and added to or subtracted from each other. For example, a function that depends linearly on three features x 1 , x 2 , and x 3 can be written as:

F X 1 , X 2 , X 3 = ω 0 + ω 1 X 1 + ω 2 X 2 + ω 3 X 3

其中参数 ω 0 , ω 1 , ω 2 , 和 ω 3 是标量。参数或权重 ω 1 , ω 2 , 和 ω 3 线性组合特征,并产生结果 F X 1 , X 2 , X 3 添加偏置项后 ω 0 。换句话说,结果是特征之间线性相互作用的结果 X 1 , X 2 , 和 X 3 ,加上偏差。

where the parameters ω 0 , ω 1 , ω 2 , and ω 3 are scalar numbers. The parameters or weights ω 1 , ω 2 , and ω 3 linearly combine the features, and produce the outcome of f ( x 1 , x 2 , x 3 ) after adding the bias term ω 0 . In other words, the outcome is produced as a result of linear interactions between the features x 1 , x 2 , and x 3 , plus bias.

非线性函数的公式表示函数输出对特征的非线性依赖性,也很容易发现。函数公式中出现的一个或多个特征的幂不是一,或者乘以或除以其他特征,或者嵌入到一些其他微积分函数中,例如正弦、余弦、指数、对数等。以下是三个示例函数依赖于三个特征非线性 X 1 , X 2 , 和 X 3 :

The formula for a nonlinear function, representing a nonlinear dependency of the function output on the features, is very easy to spot as well. One or more features appear in the function formula with a power other than one, or multiplied or divided by other features, or embedded in some other calculus functions, such as sines, cosines, exponentials, logarithms, etc. The following are three examples of functions depending nonlinearly on three features x 1 , x 2 , and x 3 :

F X 1 , X 2 , X 3 = ω 0 + ω 1 X 1 + ω 2 X 2 X 3
F X 1 , X 2 , X 3 = ω 0 + ω 1 X 1 2 + ω 2 X 2 2 + ω 3 X 3 2
F X 1 , X 2 , X 3 = ω 1 e X 1 + ω 2 e X 2 + ω 3 因斯 X 3

正如您所知,我们可以提出各种非线性函数,并且与我们可以做什么以及我们可以使用非线性相互作用对世界的多少进行建模相关的可能性是无限的。事实上,神经网络之所以成功,是因为它们能够识别数据特征之间的相关非线性相互作用。

As you can tell, we can come up with all kinds of nonlinear functions, and the possibilities related to what we can do and how much of the world we can model using nonlinear interactions are limitless. In fact, neural networks are successful because of their ability to pick up on the relevant nonlinear interactions between the data features.

我们将在整本书中使用以前的符号和术语,因此您将非常熟悉线性组合、权重、特征以及之间的线性和非线性相互作用等术语。特征。

We will use the previous notation and terminology throughout the book, so you will become very familiar with terms like linear combination, weights, features, and linear and nonlinear interactions between features.

真实数据的例子

An Example of Real Data

可以在本书的 GitHub 页面找到用于研究数据并生成以下两个示例中的图形的 Python 代码。

You can find the Python code to investigate the data and produce the figures in the following two examples at the book’s GitHub page.

结构化数据

Structured Data

我们将在此处使用的两个关于身高、体重和性别的数据集是结构化数据集的示例。它们按行和列排列。列包含特征,例如体重、身高、性别、健康指数等。行包含每个数据实例(在本例中为每个人)的特征分数。另一方面,由一堆音频文件、Facebook 帖子、图像或视频组成的数据集都是非结构化数据集的示例。

The two data sets for height, weight, and gender that we will work with here are examples of structured data sets. They come organized in rows and columns. Columns contain the features, such as weight, height, gender, health index, etc. Rows contain the feature scores for each data instance, in this case, each person. On the other hand, data sets that are a bunch of audio files, Facebook posts, images, or videos are all examples of unstructured data sets.

我从Kaggle数据科学家网站下载了两个数据集。这两个数据集都包含一定数量个体的身高、体重和性别信息。我的目标是了解一个人的体重如何取决于他们的身高。从数学上讲,我想写一个公式,将重量作为一个特征(身高)的函数:

I downloaded two data sets from the Kaggle website for data scientists. Both data sets contain height, weight, and gender information for a certain number of individuals. My goal is to learn how the weight of a person depends on their height. Mathematically, I want to write a formula for the weight as a function of one feature, the height:

w e G H t = F H e G H t

这样,如果给我一个新人的身高,我就能够预测他们的体重。当然,除了身高之外,一个人的体重还取决于其他特征,比如性别、饮食习惯、锻炼习惯、遗传倾向等。但是,对于我下载的数据集,我们只有身高、体重和体重。可获得性别数据。除非我们想要寻找更详细的数据集,或者出去收集新数据,否则我们必须利用现有的数据。此外,这个例子的目的只是为了说明真实数据和模拟数据之间的差异。当我们有更多复杂的目标时,我们可以使用具有更多特征的更复杂的数据集。

so that if I am given the height of a new person, I would be able to predict their weight. Of course, there are other features than height that a person’s weight depends on, such as their gender, eating habits, workout habits, genetic predisposition, etc. However, for the data sets that I downloaded, we only have height, weight, and gender data available. Unless we want to look for more detailed data sets, or go out and collect new data, we have to work with what we have. Moreover, the goal of this example is only to illustrate the difference between real data and simulated data. We can work with more involved data sets with larger numbers of features when we have more involved goals.

对于第一个数据集,我将体重列与高度列绘制在图 2-1中,并获得了一些似乎根本没有模式的东西!

For the first data set, I plot the weight column against the height column in Figure 2-1, and obtain something that seems to have no pattern at all!

电子邮件0201
图 2-1。当绘制第一个数据集的体重与身高的关系图时,我们无法检测到模式。散点图顶部和右侧的图表显示了身高和体重数据各自的直方图和经验分布。

对于第二个数据集,我也做了同样的事情,可以直观地观察到图2-2中明显的线性依赖关系。数据点似乎聚集在一条直线周围!

For the second data set, I do the same, and I can visually observe an obvious linear dependency in Figure 2-2. The data points seem to congregate around a straight line!

那么发生了什么?为什么我的第一个真实数据集反映了人的身高和体重之间没有任何依赖性,但我的第二个真实数据集却反映了线性依赖性?我们需要更深入地研究数据。

So what is going on? Why does my first real data set reflect no dependency between the height and weight of a person whatsoever, but my second one reflects a linear dependency? We need to look deeper into the data.

这是处理真实数据的众多挑战之一。我们不知道什么函数生成了这些数据,也不知道为什么它看起来是这样的。我们进行调查、获得见解、检测模式(如果有),然后提出假设函数。然后我们测试我们的假设,如果根据我们精心设计的绩效衡量标准表现良好,我们会将其部署到现实世界中。我们使用部署的模型进行预测,直到新数据告诉我们我们的假设不再有效,在这种情况下,我们研究更新的数据并制定新的假设。只要我们的模型投入使用,这个过程和反馈循环就会持续下去。

This is one of the many challenges of working with real data. We do not know what function generated the data, and why it looks the way it looks. We investigate, gain insights, detect patterns, if any, and we propose a hypothesis function. Then we test our hypothesis, and if it performs well based on our measures of performance, which have to be thoughtfully crafted, we deploy it into the real world. We make predictions using our deployed model, until new data tells us that our hypothesis is no longer valid, in which case we investigate the updated data and formulate a new hypothesis. This process and feedback loop keeps going for as long as our models are in business.

电子邮件0202
图 2-2。当绘制第二个数据集的体重与身高的关系图时,我们观察到线性模式。请注意,体重数据的经验分布绘制在图的右侧,身高数据的经验分布绘制在图的顶部。两者似乎都有两个峰(双峰),表明存在混合分布。事实上,身高和体重数据集都可以使用两种正态分布的混合(称为高斯混合)来建模,代表混合男性和女性数据的基础分布。因此,如果我们单独绘制女性或男性亚群的数据,如图2-6所示,我们会​​观察到身高和体重数据呈正态分布(钟形)。

在继续讨论模拟数据之前,让我们解释一下为什么第一个数据集似乎根本没有了解个体身高和体重之间的关系。经过进一步检查,我们注意到数据集中指数得分为 4 和 5(指肥胖和极度肥胖)的个体比例过高。因此,我决定按指数分数分割数据,并绘制具有相似指数分数的所有个体的体重与身高的关系图。这一次,身高和体重之间的线性相关性在图 2-3中很明显,谜团得到了解决。这可能让人感觉我们在通过调节个人的指数分数来欺骗我们的线性性。以数据探索的名义,一切都是公平的游戏。

Before moving on to simulated data, let’s explain why the first data set seemed to have no insight at all about the relationship between the height and the weight of an individual. Upon further inspection, we notice that the data set has an overrepresentation of individuals with Index scores 4 and 5, referring to obesity and extreme obesity. So, I decided to split the data by Index score, and plot the weight against the height for all individuals with similar Index scores. This time around, a linear dependency between the height and the weight is evident in Figure 2-3, and the mystery is resolved. This might feel like we are cheating our way to linearity, by conditioning on individuals’ Index scores. All is fair game in the name of data exploration.

电子邮件0203
图 2-3。当绘制第一个数据集中具有相似指数得分的个体的体重与身高的关系时,我们观察到线性模式。该图显示了指数得分为 3 的个体的体重与身高。

现在我们可以安全地假设重量与身高线性相关:

Now we can safely go ahead and hypothesize that the weight depends linearly on the height:

w e G H t = ω 0 + ω 1 × H e G H t

当然,我们剩下的任务是为参数找到合适的值 ω 0 ω 1 第 3 章教我们如何做到这一点。事实上,机器学习(包括深度学习)的大部分活动都是关于学习这些 ω 来自数据。在我们非常简单的例子中,我们只有两个 ω 需要学习,因为我们只有一个特征,即高度,并且在观察真实数据中的线性模式后,我们假设了线性依赖性。在接下来的几章中,我们将遇到一些具有数百万个数据的深度学习网络 ω 需要学习,但我们会发现问题的数学结构实际上与我们将要学习的结构完全相同在第 3 章中学习。

Of course, we are left with the task of finding appropriate values for the parameters ω 0 and ω 1 . Chapter 3 teaches us how to do exactly that. In fact, the bulk of activity in machine learning, including deep learning, is about learning these ω ’s from the data. In our very simple example, we only have two ω ’s to learn, since we only had one feature, the height, and we assumed linear dependency after observing a linear pattern in the real data. In the next few chapters, we will encounter some deep learning networks with millions of ω ’s to learn, yet we will see that the mathematical structure of the problem is in fact the same exact structure that we will learn in Chapter 3.

模拟数据示例

An Example of Simulated Data

这个例子,我模拟了我自己的身高体重数据集。模拟我们自己的数据可以避免从网络、现实世界中搜索数据甚至建立实验室来获得受控测量的麻烦。当所需数据无法获得或获取成本非常昂贵时,这是非常有价值的。它还可以通过仅更改函数中的数字来帮助测试不同的场景,而不是创建新材料或构建实验室并运行新实验。模拟数据非常方便,因为我们需要的只是一个数学函数、一个概率分布(如果我们想要涉及随机性和/或噪声)以及一台计算机。

In this example, I simulate my own height-weight data set. Simulating our own data circumvents the trouble of searching for data from the web, the real world, or even building a lab to obtain controlled measurements. This is incredibly valuable when the required data is not available or very expensive to obtain. It also helps test different scenarios by only changing numbers in a function, as opposed to, say, creating new materials or building labs and running new experiments. Simulating data is so convenient because all we need is a mathematical function, a probability distribution if we want to involve randomness and/or noise, and a computer.

让我们再次假设身高和体重之间存在线性相关性,因此我们将使用的函数是:

Let’s again assume linear dependency between the height and the weight, so the function that we will use is:

w e G H t = ω 0 + ω 1 × H e G H t

让我们能够模拟数值 H e G H t , w e G H t 对或数据点,我们必须假设参数的数值 ω 0 ω 1 。如果没有从真实数据中了解这些的正确选择 ω 的,我们只能根据问题的背景做出有根据的猜测并尝试不同的值。请注意,对于本例中的身高体重情况,我们碰巧拥有真实数据,可以使用这些数据来学习适当的值 ω 的,第 3 章的目标之一就是学习如何做到这一点。然而,在许多其他场景中,我们没有真实的数据,因此唯一的方法就是尝试这些数据的各种数值 ω 的。

For us to be able to simulate numerical ( h e i g h t , w e i g h t ) pairs, or data points, we must assume numerical values for the parameters ω 0 and ω 1 . Without having insights from real data about the correct choices for these ω ’s, we are left with making educated guesses from the context of the problem and experimenting with different values. Note that for the height-weight case in this example, we happen to have real data that we can use to learn appropriate values for the ω ’s, and one of the goals of Chapter 3 is to learn how to do that. However, in many other scenarios, we do not have real data, so the only way to go is experimenting with various numerical values for these ω ’s.

在下面的模拟中,我们设置 ω 0 = - 314 5 ω 1 = 7 07 ,因此函数变为:

In the following simulations, we set ω 0 = - 314 . 5 and ω 1 = 7 . 07 , so the function becomes:

w e G H t = - 314 5 + 7 07 × H e G H t

现在我们可以生成尽可能多的数字 H e G H t , w e G H t 根据我们的需要配对。例如,堵漏 H e G H t = 60 在权重函数的公式中,我们得到 w e G H t = - 314 5 + 7 07 × 60 = 109 7 。所以我们的线性模型预测身高为 60 英寸的人的体重 109 7 磅,我们可以在身高体重图上绘制的数据点有坐标 60 , 109 7 。在图 2-4中,我们生成了 5,000 个这样的数据点:我们为 54 到 79 英寸之间的高度选择 5,000 个值,并将它们插入重量函数中。我们注意到图 2-4中的图表是一条完美的直线,模拟数据中没有噪声或变化,因为我们没有将这些纳入我们的线性模型中。

Now we can generate as many numerical ( h e i g h t , w e i g h t ) pairs as we want. For example, plugging h e i g h t = 60 in the formula for the weight function, we get w e i g h t = - 314 . 5 + 7 . 07 × 60 = 109 . 7 . So our linear model predicts that a person whose height is 60 inches weighs 109 . 7 pounds, and the data point that we can plot on the height-weight graph has coordinates ( 60 , 109 . 7 ) . In Figure 2-4, we generate 5,000 of these data points: we choose 5,000 values for the height between 54 and 79 inches and plug them into the weight function. We notice that the graph in Figure 2-4 is a perfect straight line, with no noise or variation in the simulated data, since we did not incorporate those into our linear model.

这是模拟数据的一个标志:它执行生成它的函数执行的操作。如果我们理解用于构建模拟的函数(称为 model ),并且如果我们的计算没有积累太多数值错误和/或非常大的异常数字,那么我们就理解了模型生成的数据,并且我们可以以我们认为合适的任何方式使用这些数据。没有太多惊喜的空间。在我们的示例中,我们提出的函数是线性的,因此它的方程是一条直线,如图2-4所示,生成的数据完全位于这条直线上。

That’s a hallmark of simulated data: it does what the function(s) it was generated from does. If we understand the function (called model) that we used to build our simulation, and if our computation does not accumulate too many numerical errors and/or very large numbers that go rogue, then we understand the data that our model generates, and we can use this data in any way we see fit. There isn’t much space for surprises. In our example, our proposed function is linear, so its equation is that of a straight line, and as you see in Figure 2-4, the generated data lies perfectly on this straight line.

埃麦0204
图 2-4。模拟数据:我们使用线性函数生成 5,000 个(身高、体重)点 w e G H t = - 314 5 + 7 07 × H e G H t

如果我们想模拟更真实的身高和体重数据怎么办?然后我们可以从更现实的人口身高分布中对身高值进行采样:钟形正态分布!同样,我们知道采样的概率分布,这与真实数据的情况不同。对高度值进行采样后,我们将它们插入到重量的线性模型中,然后添加一些噪声,因为我们希望模拟数据是真实的。由于噪声具有随机性,我们还必须选择从中采样的概率分布。我们再次选择钟形正态分布,但我们可以选择均匀分布来模拟均匀随机波动。我们更现实的身高体重模型变为:

What if we want to simulate more realistic data for height and weight? Then we can sample the height values from a more realistic distribution for the heights of a human population: the bell-shaped normal distribution! Again, we know the probability distribution that we are sampling from, which is different from the case for real data. After we sample the height values, we plug those into the linear model for the weight, then we add some noise, since we want our simulated data to be realistic. Since noise has a random nature, we must also pick the probability distribution it will be sampled from. We again choose the bell-shaped normal distribution, but we could have chosen the uniform distribution to model uniform random fluctuations. Our more realistic height-weight model becomes:

w e G H t = - 314 5 + 7 07 × H e G H t + n s e

我们得到的结果如图 2-5所示。

We obtain the results seen in Figure 2-5.

电子邮件0205
图 2-5。模拟数据:我们使用线性函数生成 5,000 个(身高、体重)点 w e G H t = - 314 5 + 7 07 × H e G H t 。高度点呈正态分布,我们还添加了正态分布的噪声。请注意,图中右侧和顶部的体重和身高数据的分布分别呈正态分布。这并不奇怪,因为我们在模拟中就是这样设计的。

现在将包含我们的模拟身高体重数据的图 2-5与包含我们使用的第二个数据集中的5,000 名女性的真实身高体重数据的图 2-6进行比较。还不错,因为只花了五分钟的代码编写就生成了这些数据,而不是收集真实的数据!如果我们花更多的时间来调整我们的价值观 ω 的,以及我们添加的正态分布噪声的参数(平均值和标准差),我们将获得一个看起来更好的模拟数据集。然而,我们将把这个模拟留在这里,因为很快我们的全部重点将集中在学习我们假设的适当参数值上。楷模。

Now compare Figure 2-5 containing our simulated height-weight data to Figure 2-6 containing real height-weight data of 5,000 females from the second data set that we used. Not too bad, given that it only took five minutes of code writing to generate this data, as opposed to collecting real data! Had we spent more time tweaking the values of our ω ’s, and the parameters for the normally distributed noise that we added (mean and standard deviation), we would’ve obtained an even better-looking simulated data set. However, we will leave this simulation here since very soon our whole focus will be on learning the appropriate parameter values for our hypothesized models.

电子邮件0206
图 2-6。真实数据:第二个数据集中5,000 名女性的体重数据与身高数据相对应。请注意,图中右侧和顶部的女性体重和身高数据的分布分别呈正态分布。有关更多详细信息,请参阅本书的 GitHub 页面以获取更多详细信息。

数学模型:模拟和人工智能

Mathematical Models: Simulations and AI

我们可以始终调整我们的数学模型,使其更加现实。我们是设计师,所以我们可以决定这些模型的内容。通常的情况是,模型越模仿自然,其中包含的数学对象就越多。因此,在构建数学模型时,通常需要在更接近现实与模型的简单性和数学分析和计算的可访问性之间进行权衡。不同的设计师提出了不同的数学模型,有些模型比其他模型更好地捕捉某些现象。随着捕捉自然行为的不断探索,这些模型不断改进和发展。值得庆幸的是,我们的计算能力在过去几十年中得到了显着提高,使我们能够创建和测试更复杂、更现实的数学模型。

We can always adjust our mathematical models to make them more realistic. We are the designers, so we get to decide what goes into these models. It is often the case that the more a model mimics nature, the more mathematical objects get incorporated within it. Therefore, while building a mathematical model, the usual trade-off is between getting closer to reality, and the model’s simplicity and accessibility for mathematical analysis and computation. Different designers come up with different mathematical models, and some capture certain phenomena better than others. These models keep improving and evolving as the quest to capture natural behaviors continues. Thankfully, our computational capabilities have dramatically improved in the past decades, enabling us to create and test more involved and realistic mathematical models.

大自然既非常细致又极其广阔。自然界中的相互作用范围从亚原子量子领域一直到星系际尺度。作为人类,我们一直在努力了解自然并捕捉其复杂的组成部分及其众多的相互联系和相互作用。我们这样做的原因是多种多样的。它们包括对生命和宇宙起源的纯粹好奇,创造新技术,增强通信系统,设计药物和发现疾病治疗方法,建造武器和防御系统,以及前往遥远的星球并可能在未来居住。数学模型提供了一种出色且几乎神奇的方式来描述自然及其所有细节,仅使用数字、函数、方程,并在面临不确定性时通过概率调用量化随机性。这些数学模型的计算机模拟使我们能够研究和可视化建模系统或现象的各种简单和复杂的行为。反过来,计算机模拟的见解除了提供更深入的数学见解之外,还有助于模型增强和设计。这种令人难以置信的正反馈循环使数学建模和模拟成为不可或缺的工具,随着我们计算能力的增强,这种工具得到了极大的增强。

Nature is at the same time very finely detailed and enormously vast. Interactions in nature range from the subatomic quantum realm all the way to the intergalactic scale. We, as humans, are forever trying to understand nature and capture its intricate components with their numerous interconnections and interplays. Our reasons for this are varied. They include pure curiosity about the origins of life and the universe, creating new technologies, enhancing communication systems, designing drugs and discovering cures for diseases, building weapons and defense systems, and traveling to distant planets and perhaps inhabiting them in the future. Mathematical models provide an excellent and almost miraculous way to describe nature with all its details using only numbers, functions, equations, and invoking quantified randomness through probability when faced with uncertainty. Computer simulations of these mathematical models enable us to investigate and visualize various simple and complex behaviors of the modeled systems or phenomena. In turn, insights from computer simulations aid in model enhancement and design, in addition to supplying deeper mathematical insights. This incredibly positive feedback cycle makes mathematical modeling and simulations an indispensable tool that is enhanced greatly with our increased computational power.

宇宙的各种现象可以用数学的抽象语言精确建模,这是宇宙的奥秘;人类能够发现和理解数学,并建造出对各种有用的强大技术装置,这是人类思维的奇迹。应用程序。同样令人印象深刻的是,这些设备的核心只是计算或传输数学,更具体地说,是一堆零和一。

It is a mystery of the universe that its various phenomena can be accurately modeled using the abstract language of mathematics, and it is a marvel of the human mind that it can discover and comprehend mathematics, and build powerful technological devices that are useful for all kinds of applications. Equally impressive is that these devices, at their core, are doing nothing but computing or transmitting mathematics, more specifically, a bunch of zeros and ones.

人类能够概括他们对简单数字的理解,一直到为各种尺度的自然现象建立和应用数学模型,这是一个引人注目的例子所学知识的概括,是人类智力的标志。在人工智能领域,通用人工智能(类人人工智能和超级人工智能)和狭义人工智能(面向特定任务)的共同目标是泛化:人工智能代理将学到的能力泛化到新的和未知的情况的能力。在第 3 章中,我们将了解狭义且面向任务的人工智能的这一原理:人工智能代理从数据中学习,然后对新的和未见过的数据产生良好的预测

The fact that humans are able to generalize their understanding of simple numbers all the way to building and applying mathematical models for natural phenomena at all kinds of scales is a spectacular example of generalization of learned knowledge, and is a hallmark of human intelligence. In the AI field, a common goal for both general AI (human-like and super-AI) and narrow AI (specific task oriented) is generalization: the ability of an AI agent to generalize learned abilities to new and unknown situations. In Chapter 3, we will understand this principle for narrow and task-oriented AI: an AI agent learns from data, then produces good predictions for new and unseen data.

人工智能通过三种方式与数学模型和模拟进行交互:

AI interacts in three ways with mathematical models and simulations:

数学模型和模拟为人工智能系统创建数据进行训练。
Mathematical models and simulations create data for AI systems to train on.

自动驾驶汽车被一些人认为是人工智能的基准。在汽车的人工智能系统了解到这些是必须避免的不利事件之前,让智能汽车原型车冲下悬崖、撞到行人或撞入新的工作区域将会带来不便。模拟数据训练在这里特别有价值,因为模拟可以创建各种危险的虚拟场景,供汽车在上路之前进行训练。同样,模拟数据对于训练火星漫游器的人工智能系统、药物发现、材料设计、天气预报、航空、军事训练等也非常有帮助。

Self-driving cars are considered by some to be a benchmark for AI. It will be inconvenient to let intelligent car prototypes drive off cliffs, hit pedestrians, or crash into new work zones before the car’s AI system learns that these are unfavorable events that must be avoided. Training on simulated data is especially valuable here, as simulations can create all kinds of hazardous virtual situations for a car to train on before releasing it out on the roads. Similarly, simulated data is tremendously helpful for training AI systems for rovers on Mars, drug discovery, materials design, weather forecasting, aviation, military training, and so on.

人工智能增强了现有的数学模型和模拟。
AI enhances existing mathematical models and simulations.

人工智能具有巨大的潜力,可以帮助解决传统上数学模型和模拟困难和限制的领域,例如在离散化方程时学习模型中涉及的参数的适当值、适当的概率分布、网格形状和大小(精细的网格捕获精细的数据)不同空间和时间尺度上的细节和微妙行为),并将计算方法扩展到更长的时间或具有复杂形状的更大的域。导航、航空、金融、材料科学、流体动力学、运筹学、分子和核科学、大气和海洋科学、天体物理学、物理和网络安全等领域严重依赖数学建模和模拟。将人工智能功能集成到这些领域已经开始取得非常积极的成果。我们将在本书后面的章节中遇到人工智能增强模拟的示例。

AI has great potential to assist in areas that have traditionally been difficult and limiting for mathematical models and simulations, such as learning appropriate values for the parameters involved in the models, appropriate probability distributions, mesh shapes and sizes when discretizing equations (fine meshes capture fine details and delicate behaviors at various spatial and time scales), and scaling computational methods to longer times or to larger domains with complicated shapes. Fields like navigation, aviation, finance, materials science, fluid dynamics, operations research, molecular and nuclear sciences, atmospheric and ocean sciences, astrophysics, physical and cyber security, and many others rely heavily on mathematical modeling and simulations. Integrating AI capabilities into these domains is starting to take place with very positive outcomes. We will come across examples of AI enhancing simulations in later chapters of this book.

AI本身就是一个数学模型和模拟。
AI itself is a mathematical model and simulation.

人工智能的一大愿望是通过计算复制人类智能。成功的机器学习系统,包括具有所有架构和变体的神经网络,都是数学模型,旨在模拟人类与智能相关的任务,例如视觉、模式识别和泛化、通过自然语言进行通信以及逻辑推理。理解、情感体验、同理心和协作也与智力相关,为人类的成功和统治做出了巨大贡献,因此,如果我们想要实现通用人工智能,同时更深入地了解自然,我们还必须找到复制它们的方法智力和人脑的运作。这些领域的努力已经在进行中。我们要记住的是,在所有这些领域,机器所做的都是计算。机器计算自然语言处理的文档含义,组合和计算计算机视觉的数字图像像素,将音频信号转换为数字向量并计算人机交互的新音频等等。这样就很容易看出软件人工智能是一个大的数学模型和模拟。随着我们的进步,这一点将变得更加明显这本书。

One of the big aspirations of AI is to computationally replicate human intelligence. Successful machine learning systems, including neural networks with all their architectures and variations, are mathematical models aimed at simulating tasks that humans associate with intelligence, such as vision, pattern recognition and generalization, communication through natural language, and logical reasoning. Understanding, emotional experience, empathy, and collaboration are also associated with intelligence and have contributed tremendously to the success and domination of humankind, so we must also find ways to replicate them if we want to achieve general AI while gaining a deeper understanding of the nature of intelligence and the workings of the human brain. Efforts in these areas are already on the way. What we want to keep in mind is that in all these areas, what machines are doing is computing. Machines compute meanings of documents for natural language processing, combine and compute digital image pixels for computer vision, convert audio signals to vectors of numbers and compute new audio for human-machine interaction, and so on. It is then easy to see how software AI is one big mathematical model and simulation. This will become more evident as we progress in this book.

我们从哪里获取数据?

Where Do We Get Our Data From?

当我第一次决定进入人工智能领域时,我想运用我的数学知识来帮助解决我感兴趣的现实问题。我在一个饱受战争蹂躏的国家长大,目睹了许多问题的爆发、破坏,然后最终消散或得到解决,要么通过直接修复,要么通过围绕问题进行调整的人类网络并适应全新的(不稳定的)平衡。战争中的常见问题是不同供应链突然大规模中断、大部分电网突然遭到破坏、某些桥梁的定点轰炸导致整个道路网络突然瘫痪、恐怖分子网络突然出现、黑市、贩运、通货膨胀、和贫困。在这些情况下,数学可以帮助解决的问题数量是无限的,包括战争战术和战略。为了美国的安全,我的博士学位。在数学专业和大学任期内,我开始接触公司、政府机构和军队,寻找具有真实数据的真实项目。我提出免费帮助他们找到问题的解决方案。我不知道的是,并通过惨痛的教训才了解到,获取真实数据是最大的障碍。存在许多法规、隐私问题、机构审查委员会和其他障碍。即使在跨越了所有这些障碍之后,公司、机构和组织仍然倾向于保留他们的数据,即使他们知道自己没有充分利用这些数据,而且人们几乎不得不乞求才能获得真实的数据。事实证明,我的经历并不是独一无二的。同样的情况也发生在该领域的许多其他人身上。

When I first decided to enter the AI field, I wanted to apply my mathematical knowledge to help solve real-world problems that I felt passionate about. I grew up in a war-torn country, and saw many problems erupt, disrupt, then eventually dissipate or get resolved, either by direct fixes or by the human network adjusting around them and settling into completely new (unstable) equilibria. Common problems in war were sudden and massive disruptions to different supply chains, sudden destruction of large parts of the power grid, sudden paralysis of entire road networks by targeted bombing of certain bridges, sudden emergence of terrorist networks, black markets, trafficking, inflation, and poverty. The number of problems that math can help solve in these scenarios, including war tactics and strategy, is limitless. From the safety of the United States, my Ph.D. in mathematics, and my tenure at a university, I started approaching companies, government agencies, and the military, looking for real projects with real data to work on. I offered to help find solutions to their problems for free. What I did not know, and learned the hard way, was that getting real data was the biggest hurdle. There are many regulations, privacy issues, institutional review boards, and other obstacles standing in the way. Even after jumping through all these hoops, the companies, institutions, and organizations tend to hold on to their data, even when they know they are not making the best use of it, and one almost has to beg in order to get real data. It turned out that the experience I had was not unique. The same had happened to many others in the field.

这个故事并不是要阻止您获取训练人工智能系统所需的真实数据。重点是,如果您遇到所需数据所有者的犹豫和抵制,不要感到惊讶和沮丧。不断询问,就会有人愿意迈出这一信念的一步。

This story is not meant to discourage you from getting the real data that you need to train your AI systems. The point is to not get surprised and disheartened if you encounter hesitation and resistance from the owners of the data that you need. Keep asking, and someone will be willing to take that one leap of faith.

有时,您需要的数据可以在网络上公开获得。对于本章中的简单模型,我使用的数据集来自Kaggle网站。还有其他很棒的公共数据存储库,我不会在这里列出,但是使用“最佳数据存储库”等关键字进行简单的 Google 搜索将返回出色的结果。一些存储库面向计算机视觉,其他存储库面向自然语言处理、音频生成和转录、科学研究等。

Sometimes the data you need is available publicly on the web. For the simple models in this chapter, I am using data sets from the Kaggle website. There are other great public data repositories, which I will not list here, but a simple Google search with keywords like “best data repositories” will return excellent results. Some repositories are geared toward computer vision, others toward natural language processing, audio generation and transcription, scientific research, and so on.

抓取网络获取数据很常见,但你必须遵守你所爬行的网站的规则。你还必须学习如何爬行(有人说数据科学家和统计学家的区别在于数据科学家知道如何破解!)。有些网站要求您在抓取之前获得书面许可。例如,如果您对社交媒体用户行为或协作网络感兴趣,您可以抓取社交媒体和专业网络(Facebook、Instagram、YouTube、Flickr、LinkedIn 等)以获取用户帐户的统计信息,例如数量这些网站上的朋友或联系、喜欢、评论和活动。您最终将得到包含数十万条记录的非常大的数据集,然后您可以对其进行计算。

Crawling the web to acquire data is common, but you have to abide by the rules of the websites you are crawling. You also have to learn how to crawl (some people say that the difference between data scientists and statisticians is that data scientists know how to hack!). Some websites require you to obtain written permission before you crawl. For example, if you are interested in social media user behavior, or in collaboration networks, you can crawl social media and professional networks—Facebook, Instagram, YouTube, Flickr, LinkedIn, etc.—for statistics on user accounts, such as the number of friends or connections, likes, comments, and activity on these sites. You will end up with very large data sets with hundreds of thousands of records, which you can then do your computations on.

为了直观地了解数据如何集成到人工智能中,以及进入各种系统的数据类型,同时避免被所有信息和数据淹没的感觉,开发一个探索成功的人工智能系统所训练的数据集(如果有)的习惯。您不必下载它们并进行操作。浏览数据集、其元数据、其附带的功能和标签(如果有)等,足以让您熟悉数据。例如,我们将在第 7 章中了解DeepMind 的WaveNet (2016) ,它是一个神经网络,可以生成原始机器音频,具有逼真的人声或令人愉悦的音乐片段。它可以完成诸如文本到音频转换之类的任务,并具有自然的人声含义,即使是特定人的声音(如果网络以该人的声音为条件)也是如此。我们将理解数学意义当我们在第 7 章学习 WaveNet 时进行调节。目前,将其视为对问题人为施加的限制,因此它将其结果限制为一组特定的结果。那么 WaveNet 是用什么数据训练的呢?对于不以文本为条件的多说话者音频生成,WaveNet 在音频文件数据集上进行了训练,该数据集包含来自 109 个不同说话者的 44 小时音频:来自 CSTR 语音克隆工具包 (2012) 的英语多说话者语料库为了将文本转换为语音,WaveNet 在北美英语数据集(包含 24 小时的语音数据)和中文普通话数据集(包含 34.8 小时的语音数据)上进行了训练。为了生成音乐,WaveNet 在YouTube Piano 数据集上进行了训练,该数据集包含从 YouTube 视频中获得的 60 小时的钢琴独奏音乐,以及MagnaTagATune 数据集(2009),该数据集包含约 200 小时的音乐音频,其中每个 29-第二个剪辑标有 188 个标签,描述音乐的流派、乐器、节奏、音量、情绪和各种其他标签。标记数据对于人工智能系统非常有价值,因为它提供了衡量假设函数输出的基本事实。我们将在接下来的几年中学习这一点部分。

To gain an intuitive understanding of how data gets integrated into AI, and the type of data that goes into various systems, while at the same time avoiding feeling overwhelmed by all the information and the data that is out there, it is beneficial to develop a habit of exploring the data sets that successful AI systems were trained on, if they are available. You do not have to download them and work on them. Browsing the data set, its metadata, what features and labels (if any) it comes with, etc., is enough to get you comfortable with data. For example, DeepMind’s WaveNet (2016), which we will learn about in Chapter 7, is a neural network that generates raw machine audio, with realistic-sounding human voices or enjoyable pieces of music. It accomplishes tasks like text-to-audio conversion with natural-sounding human voice connotations, even with a specific person’s voice if the network gets conditioned on this person’s voice. We will understand the mathematical meaning of conditioning when we study WaveNet in Chapter 7. For now, think of it as a restriction imposed artificially on a problem so it restricts its results to a certain set of outcomes. So what data was WaveNet trained on? For multispeaker audio generation that is not conditioned on text, WaveNet was trained on a data set of audio files consisting of 44 hours of audio from 109 different speakers: the English Multispeaker Corpus from CSTR Voice Cloning Toolkit (2012). For converting text to speech, WaveNet was trained on the North American English data set, which contains 24 hours of speech data, and on the Mandarin Chinese data set, which has 34.8 hours of speech data. For generating music, WaveNet was trained on the YouTube Piano data set, which has 60 hours of solo piano music obtained from YouTube videos, and the MagnaTagATune data set (2009), which consists of about 200 hours of music audio, where each 29-second clip is labeled with 188 tags describing the genre, instrumentation, tempo, volume, mood, and various other labels for the music. Labeled data is extremely valuable for AI systems, because it provides a ground truth to measure the output of your hypothesis function against. We will learn this in the next few sections.

著名的图像分类怎么样(用于计算机视觉)亚历克斯网络(2012)?其卷积神经网络使用哪些数据进行训练和测试?AlexNet 在ImageNet上进行训练,该数据集包含数百万张图像(从互联网上抓取)并用数千个类别进行标记(众包人类标签)。

How about the famous image classification (for computer vision) AlexNet (2012)? What data was its convolutional neural network trained and tested on? AlexNet was trained on ImageNet, a data set containing millions of images (scraped from the internet) and labeled (crowdsourced human labelers) with thousands of classes.

请注意,所有这些示例都是非结构化数据的示例。

Note that all of these examples were all examples of unstructured data.

如果某个系统训练所用的数据不是公开的,那么最好查找有关该系统的已发表论文或其文档,并了解如何获取所需数据。仅此一点就能教会你很多东西。

If the data that a certain system was trained on is not publicly available, it is good to look up the published paper on the system or its documentation and read about how the required data was obtained. That alone will teach you a lot.

在继续进行数学计算之前,请记住以下要点:

Before moving on to doing mathematics, keep in mind the following takeaways:

  • 人工智能系统需要数字数据。

  • AI systems need digital data.

  • 有时,您需要的数据并不容易获取。

  • Sometimes, the data you need is not easy to acquire.

  • 有一场运动将我们的数字化全世界。

  • There is a movement to digitalize our whole world.

数据分布、概率和统计词汇

The Vocabulary of Data Distributions, Probability, and Statistics

当你进入一个新领域,首先要学习的是该领域的词汇。这类似于学习一门新语言。你可以在教室里学习它,然后受苦,或者你可以去一个说这种语言的国家旅行,听一些常用术语。您不必知道“bonjour”在法语中的意思。但是当你在法国时,你注意到人们总是互相说这句话,所以你也开始这么说。有时您不会在正确的上下文中使用它,例如当您必须说“bonsoir”而不是“bonjour”时。但慢慢地,当您发现自己在法国停留的时间更长时,您将在正确的上下文中使用正确的词汇。

When you enter a new field, the first thing you want to learn is the vocabulary of that field. It is similar to learning a new language. You can learn it in a classroom, and suffer, or you can travel to a country that speaks the language, and listen to frequently used terms. You don’t have to know what “bonjour” means in French. But while you are in France, you notice that people say it to each other all the time, so you start saying it as well. Sometimes you will not use it in the right context, like when you have to say “bonsoir” instead of “bonjour.” But slowly, as you find yourself staying longer in France, you will be using the right vocabulary in the right context.

尽可能快地学习词汇而不必掌握任何细节的另一个好处是,不同的领域用不同的术语指代相同的概念,因为那里存在大量的词汇冲突。这最终成为混乱的一大根源,因此也是语言障碍的体现。当您学习该领域的常用词汇时,您会意识到您可能已经知道这些概念,只是现在您对它们有了新名称。

One more advantage of learning the vocabulary as fast as you can, without necessarily mastering any of the details, is that different fields refer to the same concepts with different terms, since there is a massive vocabulary collision out there. This ends up being a big source of confusion, therefore the embodiment of language barriers. When you learn the common vocabulary of the field, you will realize that you might already know the concepts, except that now you have new names for them.

为了人工智能应用的目的,你想知道的概率和统计词汇并不太多。一旦我们开始使用每个术语,我就会对其进行定义,但请注意,概率论的目标是对随机或随机数量和事件做出确定性陈述,因为人类讨厌不确定性,喜欢他们的世界是可控和可预测的。每当您阅读有关人工智能、机器学习或数据科学的内容时,请注意以下来自概率和统计领域的语言。再次强调,您还不必了解任何定义;只需了解其中的任何定义即可。您只需要听到以下各节中讨论的术语并熟悉它们各自的进展方式即可。

The vocabulary terms from probability and statistics that you want to know for the purposes of AI applications are not too many. I will define each term once we get to use it, but note that the goal of probability theory is to make deterministic statements about random or stochastic quantities and events, since humans hate uncertainty and like their world to be controllable and predictable. Watch for the following language from the fields of probability and statistics whenever you are reading about AI, machine learning, or data science. Again, you do not have to know any of the definitions yet; you just need to hear the terms discussed in the following sections and be familiar with the way they progress after each other.

随机变量

Random Variables

这一切从随机变量开始。数学家不停地谈论函数。函数具有确定的或确定性的结果。当您计算一个函数时,您确切地知道它将返回什么值。评估函数 X 2 3,你确信你会得到 3 2 = 9 。另一方面,随机变量没有确定性结果。它们的结果是不确定的、不可预测的或随机的。当您调用随机变量时,在实际看到结果之前,您不知道它将返回什么值。既然你不能再以确定性为目标,那么你可以目标是量化获得结果的可能性。例如,当您掷骰子时,假设您掷的骰子没有被篡改,您可以自信地说获得结果 4 的机会是 1/6。在你真正掷骰子之前,你永远不会提前知道会得到什么结果。如果你这样做,那么赌场就会倒闭,而金融部门将消除其整个预测分析和风险管理部门。就像确定性函数一样,随机变量可以返回离散集(离散随机变量)或连续体(连续随机变量)的结果。随机变量和函数之间的主要区别在于结果的随机性与确定性。

It all starts with random variables. Math people talk about functions nonstop. Functions have certain or deterministic outcomes. When you evaluate a function, you know exactly what value it will return. Evaluate the function x 2 at 3, and you are certain you will get 3 2 = 9 . Random variables, on the other hand, do not have deterministic outcomes. Their outcomes are uncertain, unpredictable, or stochastic. When you call a random variable, you do not know, before you actually see the outcome, what value it will return. Since you cannot aim for certainty anymore, what you can instead aim for is quantifying how likely it is to get an outcome. For example, when you roll a die, you can confidently say that your chance to get the outcome 4 is 1/6, assuming that the die you rolled is not tampered with. You never know ahead of time what outcome you will get before you actually roll the die. If you did, then casinos would run out of business, and the finance sector would eliminate its entire predictive analytics and risk management departments. Just like deterministic functions, a random variable can return outcomes from a discrete set (discrete random variable) or from the continuum (continuous random variable). The key distinction between a random variable and a function is in the randomness versus the certainty of the outcomes.

概率分布

Probability Distributions

随机变量我们定义连续随机变量的概率密度函数和离散随机变量的概率质量函数。我们称这两种发行版是为了增加我们的混乱。通常,分布代表离散随机变量还是连续随机变量可以根据上下文来理解。使用这一术语,我们有时会说一个随机变量,无论是连续的还是离散的,都是从概率分布中采样的,而多个随机变量是从联合概率分布中采样的。在实践中,我们很少知道数据中涉及的所有随机变量的完整联合概率分布。当我们这样做时,或者如果我们能够从数据中学习它,这是一件很强大的事情。

After random variables we define probability density functions for continuous random variables and probability mass functions for discrete random variables. We call both distributions in order to add to our confusion. Usually, whether a distribution represents a discrete or a continuous random variable is understood from the context. Using this terminology, we sometimes say that one random variable, whether continuous or discrete, is sampled from a probability distribution, and multiple random variables are sampled from a joint probability distribution. In practice, it is rare that we know the full joint probability distribution of all the random variables involved in our data. When we do, or if we are able to learn it from the data, it is a powerful thing.

边际概率

Marginal Probabilities

边际概率分布从字面上看,在联合概率分布的边缘(如果我们用包含所涉及变量的所有组合状态的概率的表来表示联合概率分布;例如,请参阅此维基百科页面上的第一个表。在这种设置中,您很幸运能够访问多个随机变量的完整联合概率分布,并且您有兴趣找出其中一个或几个随机变量的概率分布。您可以使用求和规则轻松找到这些边际概率分布概率,例如:

Marginal probability distributions sit literally on the margins of a joint probability distribution (if we represent the joint probability distribution with a table containing the probabilities of all the combined states of the involved variables; see, for example, the first table on this Wikipedia page). In this setting, you are lucky enough to have access to the full joint probability distribution of multiple random variables, and you are interested in finding out the probability distribution of only one or few of them. You can find these marginal probability distributions easily using the sum rule for probabilities, for example:

p X = Σ yε全部状态y p X , y

均匀分布和正态分布

The Uniform and the Normal Distributions

均匀分布正态分布最流行的连续分布,所以我们从它们开始。正态分布和概率论的基本中心极限定理密切相关。还有许多其他有用的分布代表数据中涉及的不同随机变量,但我们并不立即需要它们,因此我们将它们推迟到需要使用它们为止。

The uniform distribution and the normal distribution are the most popular continuous distributions, so we start with them. The normal distribution and the fundamental central limit theorem from probability theory are intimately related. There are many other useful distributions representing the different random variables involved in our data, but we do not need them right away, so we postpone them until we need to use them.

条件概率和贝叶斯定理

Conditional Probabilities and Bayes’ Theorem

此时此刻我们开始处理多个随机变量(例如我们的性别、身高、体重和健康指数数据),几乎总是这样,我们引入条件概率、贝叶斯规则定理,以及条件概率的乘积或链式法则,以及独立和条件独立随机变量的概念(知道一个变量的值不会改变另一个变量的概率)。

The moment we start dealing with multiple random variables (such as our gender, height, weight, and health index data), which is almost always the case, we introduce conditional probabilities, Bayes’ Rule or Theorem, and the product or chain rule for conditional probabilities, along with the concepts of independent and conditionally independent random variables (knowing the value of one does not change the probability of the other).

条件概率和联合分布

Conditional Probabilities and Joint Distributions

两个都条件概率和联合分布涉及多个随机变量,因此它们之间存在某种关系是有道理的。对联合概率分布图进行切片(当我们固定其中一个变量的值时),我们得到条件概率分布(参见稍后的图 2-7 )。

Both conditional probabilities and joint distributions involve multiple random variables, so it makes sense that they have something to do with each other. Slice the graph of a joint probability distribution (when we fix the value of one of the variables) and we get a conditional probability distribution (see Figure 2-7 later on).

贝叶斯规则与联合概率分布

Bayes’ Rule Versus Joint Probability Distribution

这是记住以下几点非常重要:如果我们碰巧能够访问我们在设置中关心的所有多个随机变量的完整联合概率分布,那么我们就不需要贝叶斯规则。换句话说,当我们无法访问所涉及的随机变量的完整联合概率分布时,贝叶斯规则可以帮助我们计算所需的条件概率。

It is very important to keep the following in mind: if we happen to have access to the full joint probability distribution of all the multiple random variables that we care for in our setting, then we would not need Bayes’ Rule. In other words, Bayes’ Rule helps us calculate the desired conditional probabilities when we do not have access to the full joint probability distribution of the involved random variables.

先验分布、后验分布和似然函数

Prior Distribution, Posterior Distribution, and Likelihood Function

从逻辑和数学的角度来看,我们可以定义条件概率,然后顺利地进行计算和生活。然而,从业者对不同的条件概率给出不同的名称,具体取决于他们是根据已观察到的数据还是根据他们仍需要估计的权重(也称为参数)进行调节。这里的词汇是:先验分布(在观察任何数据之前模型权重的一般概率分布)、后验分布(给定观察数据的权重的概率分布)和似然函数(编码观察概率的函数)给定特定权重分布的数据点)。所有这些都可以通过贝叶斯规则以及联合分布联系起来。

From logical and mathematical standpoints, we can define conditional probabilities, then move on smoothly with our calculations and our lives. Practitioners, however, give different names to different conditional probabilities, depending on whether they are conditioning on data that has been observed or on weights (also called parameters) that they still need to estimate. The vocabulary words here are: prior distribution (general probability distribution for the weights of our model prior to observing any data), posterior distribution (probability distribution for the weights given the observed data), and the likelihood function (function encoding the probability of observing a data point given a particular weight distribution). All of these can be related through the Bayes’ Rule, as well as through the joint distribution.

我们说似然函数而不是似然分布

We Say Likelihood Function not Likelihood Distribution

我们将似然度称为函数而不是分布,因为概率分布必须加起来为 1(或者如果我们处理连续随机变量则积分为 1),但似然函数不一定加起来为 1(或积分为 1)。在连续随机变量的情况下为 1)。

We refer to the likelihood as a function and not as a distribution because probability distributions must add up to one (or integrate to one if we are dealing with continuous random variables), but the likelihood function does not necessarily add up to one (or integrate to one in the case of continuous random variables).

混合分布

Mixtures of Distributions

我们可以混合概率分布并产生分布混合高斯混合非常有名。包含男性和女性测量值的早期身高数据是高斯混合的一个很好的例子。

We can mix probability distributions and produce mixtures of distributions. Gaussian mixtures are pretty famous. The earlier height data that contains measurements for both males and females is a good example of a Gaussian mixture.

随机变量的和与积

Sums and Products of Random Variables

我们可以添加或乘以从简单分布中采样的随机变量,以生成具有更复杂分布的新随机变量,代表更复杂的随机事件。这里通常研究的自然问题是:总和随机变量或乘积随机变量的分布是什么?

We can add or multiply random variables sampled from simple distributions to produce new random variables with more complex distributions, representing more complex random events. The natural question that’s usually investigated here is: what is the distribution of the sum random variable or the product random variable?

使用图形表示联合概率分布

Using Graphs to Represent Joint Probability Distributions

最后,我们使用有向和无向图表示(图)来有效分解联合概率分布。这使得我们的计算生活变得更加便宜和容易处理。

Finally, we use directed and undirected graph representations (diagrams) to efficiently decompose joint probability distributions. This makes our computational life much cheaper and tractable.

期望、均值、方差和不确定性

Expectation, Mean, Variance, and Uncertainty

数量是概率、统计学和数据科学的核心:期望均值,量化平均值;方差标准差,量化平均值周围的分布,从而编码不确定性。我们的目标是控制方差以减少不确定性。方差越大,使用平均值进行预测时可能犯的错误就越大。因此,当您探索该领域时,您经常会注意到数学陈述、不等式和定理大多涉及对涉及随机性的任何数量的期望和方差的一些控制。

Four quantities are central to probability, statistics, and data science: expectation and mean, quantifying an average value, and the variance and standard deviation, quantifying the spread around that average value, hence encoding uncertainty. Our goal is to have control over the variance in order to reduce the uncertainty. The larger the variance, the more error you can commit when using your average value to make predictions. Therefore, when you explore the field, you often notice that mathematical statements, inequalities, and theorems mostly involve some control over the expectation and variance of any quantities that involve randomness.

当我们有一个具有相应概率分布的随机变量时,我们计算期望(预期平均结果)、方差(与预期平均值的预期平方距离)和标准差(与平均值的预期距离)。对于已经采样或观察到的数据,例如上面我们的身高和体重数据,我们计算样本均值(平均值)、方差(距均值的平均平方距离)和标准差(距均值的平均距离,因此衡量均值周围的分布)。因此,如果我们关心的数据尚未被采样或观察,我们会使用期望语言对其进行推测,但如果我们已经有观察或测量的样本,我们会计算其统计数据。自然地,我们感兴趣的是我们的推测与我们对观测数据的计算统计数据的偏差有多大,以及在我们实际上可以测量整个人口的数据的限制(但理想化)情况下会发生什么。大数定律答案这对我们来说并告诉我们,在这种限制情况下(当样本量达到无穷大时),我们的期望与样本平均值相匹配。

When we have one random variable with a corresponding probability distribution, we calculate the expectation (expected average outcome), variance (expected squared distance from the expected average), and standard deviation (expected distance from the average). For data that has been already sampled or observed, for example, our height and weight data above, we calculate the sample mean (average value), variance (average squared distance from the mean), and standard deviation (average distance from the mean, so this measures the spread around the mean). So if the data we care for has not been sampled or observed yet, we speculate on it using the language of expectations, but if we already have an observed or measured sample, we calculate its statistics. Naturally we are interested in how far off our speculations are from our computed statistics for the observed data, and what happens in the limiting (but idealistic) case where we can in fact measure data for the entire population. The law of large numbers answers that for us and tells us that in this limiting case (when the sample size goes to infinity), our expectation matches the sample mean.

协方差和相关性

Covariance and Correlation

什么时候我们有两个或多个随机变量,我们计算协方差相关性协方差矩阵。这是线性代数领域及其向量、矩阵和矩阵分解(例如特征值和奇异值分解)语言与概率和统计领域结合的时候。每个随机变量的方差位于协方差矩阵的对角线上,每个可能对的协方差位于对角线之外。协方差矩阵是对称的。当您使用标准线性代数技术对其进行对角化时,您就可以消除所涉及的随机变量的相关性。

When we have two or more random variables, we calculate the covariance, correlation, and covariance matrix. This is when the field of linear algebra with its language of vectors, matrices, and matrix decompositions (such as eigenvalues and singular value decompositions) gets married to the field of probability and statistics. The variance of each random variable sits on the diagonal of the covariance matrix, and the covariances of each possible pair sit off the diagonal. The covariance matrix is symmetric. When you diagonalize it, using standard linear algebra techniques, you uncorrelate the involved random variables.

与此同时,我们暂停并确保我们知道独立性和零协方差之间的区别。协方差和相关性都是为了捕获两个随机变量之间的线性关系。相关性适用于标准化随机变量,因此即使随机变量或数据测量的尺度差异很大,我们仍然可以检测线性关系。当你标准化一个数量时,它的规模就不再重要了。无论是以百万为单位还是以 0.001 为单位来衡量都没有关系。协方差适用于非标准化随机变量。生活并不都是线性的。独立性比零强协方差。

Meanwhile, we pause and make sure we know the difference between independence and zero covariance. Covariance and correlation are all about capturing a linear relationship between two random variables. Correlation works on normalized random variables, so that we can still detect linear relationships even if random variables or data measurements have vastly different scales. When you normalize a quantity, its scale doesn’t matter anymore. It wouldn’t matter whether it is measured on a scale of millions or on a 0.001 scale. Covariance works on unnormalized random variables. Life is not all linear. Independence is stronger than zero covariance.

马尔可夫过程

Markov Process

马尔可夫过程对于人工智能的强化学习范式非常重要。它们的特征是系统的所有可能状态、代理可以执行的所有可能动作的集合(向左移动、向右移动等)、包含所有状态之间的转移概率的矩阵、什么的概率分布状态代理在采取某种行动后将过渡到,以及我们希望最大化的奖励函数。人工智能的两个流行例子包括棋盘游戏和智能恒温器(例如 Nest)。我们将在第 11 章中讨论这些内容。

Markov processes are very important for AI’s reinforcement learning paradigm. They are characterized by all possible states of a system, a set of all possible actions that can be performed by an agent (move left, move right, etc.), a matrix containing the transition probabilities between all states, the probability distribution for what states an agent will transition to after taking a certain action, and a reward function, which we want to maximize. Two popular examples from AI include board games and a smart thermostat such as Nest. We will go over these in Chapter 11.

标准化、缩放和/或标准化随机变量或数据集

Normalizing, Scaling, and/or Standardizing a Random Variable or Data Set

这是存在词汇冲突的众多案例之一。归一化、缩放和标准化在各种上下文中都是同义词。目标始终是相同的。从数据或随机变量的所有可能结果中减去一个数字(偏移),然后除以一个常数(比例)。如果减去数据样本的平均值(或随机变量的期望)并除以它们的标准差,那么您将得到新的标准化标准化数据值(或新的标准化或标准化随机变量),其平均值等于零(或期望零)且标准差等于一。相反,如果您减去最小值并除以范围(最大值减去最小值),那么您将得到新的数据值或新的随机变量,其结果均在 0 到 1 之间。有时我们谈论标准化数字向量。在这种情况下,我们的意思是将向量中的每个数字除以向量本身的长度,这样我们就得到了一个长度为 1 的新向量。因此,无论我们说我们正在标准化、缩放或标准化一组数字,目标都是尝试控制这些数字的值,将它们以零为中心,和/或将它们的分布限制为小于或等于 1。同时保留其固有的可变性。

This is one of the many cases where there is a vocabulary collision. Normalizing, scaling, and standardizing are used synonymously in various contexts. The goal is always the same. Subtract a number (shift) from the data or from all possible outcomes of a random variable, then divide by a constant number (scale). If you subtract the mean of your data sample (or the expectation of your random variable) and divide by their standard deviation, then you get new standardized or normalized data values (or a new standardized or normalized random variable) that have a mean equal to zero (or expectation zero) and standard deviation equal to one. If instead you subtract the minimum and divide by the range (max value minus min value), then you get new data values or a new random variable with outcomes all between zero and one. Sometimes we talk about normalizing vectors of numbers. In this case, what we mean is we divide every number in our vector by the length of the vector itself, so that we obtain a new vector of length one. So whether we say we are normalizing, scaling, or standardizing a collection of numbers, the goal is to try to control the values of these numbers, center them around zero, and/or restrict their spread to be less than or equal to one while at the same time preserving their inherent variability.

常见示例

Common Examples

数学家喜欢用抛硬币、掷骰子、从瓮中抽球、从甲板上抽牌、火车到达车站、顾客拨打热线电话、顾客点击广告或网站链接、疾病及其症状、犯罪等来表达概率概念试验和证据,以及发生某些事情(例如机器故障)之前的时间。不要惊讶这些例子无处不在,因为它们可以很好地推广到许多其他现实生活中的情况。

Mathematicians like to express probability concepts in terms of flipping coins, rolling dice, drawing balls from urns, drawing cards from decks, trains arriving at stations, customers calling a hotline, customers clicking on an ad or a website link, diseases and their symptoms, criminal trials and evidence, and the time until something happens, such as a machine failing. Do not be surprised that these examples are everywhere, as they generalize nicely to many other real-life situations.

除了这张概率论图之外,我们还将借用统计力学(例如,配分函数)和信息论(例如,信号与噪声、熵和交叉熵函数)中的极少数术语和函数。当我们稍后遇到它们时,我们会解释这些章节。

In addition to this map of probability theory, we will borrow very few terms and functions from statistical mechanics (for example, the partition function) and information theory (for example, signal versus noise, entropy, and the cross-entropy function). We will explain these when we encounter them in later chapters.

连续分布与离散分布(密度与质量)

Continuous Distributions Versus Discrete Distributions (Density Versus Mass)

什么时候我们处理连续分布,重要的是使用观察或采样某个值附近周围的数据点等术语,而不是观察或采样精确值。事实上,在这种情况下观察到精确值的概率为零。

When we deal with continuous distributions, it is important to use terms like observing or sampling a data point near or around a certain value instead of observing or sampling an exact value. In fact, the probability of observing an exact value in this case is zero.

当我们的数字处于连续体时,一个值和下一个值之间不存在离散的分离。实数具有无限的精度。例如,如果我测量一个男性的身高,结果是 6 英尺,我不知道我的测量值到底是 6 英尺,还是 6.00000000785 英尺,还是 5.9999111134255 英尺。最好将我的观测设置在 6 英尺左右的区间内,例如 5.95 <高度< 6.05,然后量化观测到 5.95 到 6.05 英尺之间高度的概率。

When our numbers are in the continuum, there is no discrete separation between one value and the next value. Real numbers have an infinite precision. For example, if I measure the height of a male and I get 6 feet, I wouldn’t know whether my measurement is exactly 6 feet or 6.00000000785 feet or 5.9999111134255 feet. It’s better to set my observation in an interval around 6 feet, for example 5.95 < height < 6.05, then quantify the probability of observing a height between 5.95 and 6.05 feet.

对于离散随机变量,我们不必担心,因为我们可以轻松地将可能的值彼此分开。例如,当我们掷骰子时,可能的值为 1、2、3、4、5 或 6。因此,我们可以自信地断言掷出精确 5 的概率是 1/6。此外,离散随机变量可以产生非数值结果;例如,当我们抛硬币时,我们可能的值是正面或反面。连续随机变量只能有数值结果。

We do not have such a worry for discrete random variables, as we can easily separate the possible values from each other. For example, when we roll a die, our possible values are 1, 2, 3, 4, 5, or 6. So we can confidently assert that the probability of rolling an exact 5 is 1/6. Moreover, a discrete random variable can have nonnumerical outcomes; for example, when we flip a coin, our possible values are heads or tails. A continuous random variable can only have numerical outcomes.

由于这个推理,当我们有一个连续随机变量时,我们定义它的概率密度函数,而不是它的概率质量函数,就像离散随机变量的情况一样。密度指定了一定长度、面积或体积的空间内存在多少物质(取决于我们所处的维度)。为了找到指定区域中物质的质量,我们将密度乘以所考虑区域的长度、面积或体积。如果给定每个无限小区域的密度,那么我们必须对整个区域进行积分才能找到该区域内的质量,因为积分类似于无数个无限小区域的总和。

Because of that reasoning, when we have a continuous random variable, we define its probability density function, not its probability mass function, as in the case of discrete random variables. A density specifies how much of a substance is present within a certain length or area or volume of space (depending on the dimension we’re in). To find the mass of a substance in a specified region, we multiply the density by the length, area, or volume of the considered region. If we are given the density per an infinitesimally small region, then we must integrate over the whole region to find the mass within that region, because an integral is akin to a sum over infinitely many infinitesimally small regions.

我们将在第 11 章中详细阐述这些想法并用数学形式化它们。目前,我们强调以下几点:

We will elaborate on these ideas and mathematically formalize them in Chapter 11. For now, we stress the following:

  • 如果我们只有一个连续随机变量,例如某一人群中男性的身高,那么我们用一维概率密度函数来表示其概率分布: F X 1 。为了找到高度在 5.95 < height < 6.05 之间的概率,我们对概率密度函数进行积分 F X 1 在区间 (5.95, 6.05) 上,我们写:

    5 95 < H e G H t < 6 05 = 595 605 F X 1 d X 1
  • If we only have one continuous random variable, such as the height of males in a certain population, then we use a one-dimensional probability density function to represent its probability distribution: f ( x 1 ) . To find the probability of the height being between 5.95 < height < 6.05, we integrate the probability density function f ( x 1 ) over the interval (5.95, 6.05), and we write:

    P ( 5 . 95 < h e i g h t < 6 . 05 ) = 5.95 6.05 f ( x 1 ) d x 1
  • 如果我们有两个连续随机变量,例如某个人群中男性的身高和体重,或者一个人的真实身高和测量身高(通常包含随机噪声),那么我们使用二维概率密度函数来表示它们的联合概率分布 F X 1 , X 2 。所以为了求出身高在 5.95 < height < 6.05之间体重在 160 < Weight < 175 之间的联合概率,我们对联合概率密度函数进行双重积分 F X 1 , X 2 ,假设我们知道公式 F X 1 , X 2 ,在区间 (5.95, 6.05) 和 (160,175) 上,我们写:

    5 95 < H e G H t < 6 05 , 160 < w e G H t < 175 = 160 175 595 605 F X 1 , X 2 d X 1 d X 2
  • If we have two continuous random variables, such as the height and weight of males in a certain population, or the true height and the measured height of a person (which usually includes random noise), then we use a two-dimensional probability density function to represent their joint probability distribution: f ( x 1 , x 2 ) . So in order to find the joint probability that the height is between 5.95 < height < 6.05 and the weight is between 160 < weight < 175, we double integrate the joint probability density function f ( x 1 , x 2 ) , assuming that we know the formula for f ( x 1 , x 2 ) , over the intervals (5.95, 6.05) and (160,175), and we write:

    P ( 5 . 95 < h e i g h t < 6 . 05 , 160 < w e i g h t < 175 ) = 160 175 5.95 6.05 f ( x 1 , x 2 ) d x 1 d x 2
  • 如果我们有两个以上的连续随机变量,那么我们使用更高维的概率密度函数来表示它们的联合分布。例如,如果我们有某个人群中男性的身高、体重和血压,那么我们使用三维联合概率分布函数: F X 1 , X 2 , X 3 。与之前解释的推理类似,找到第一个随机变量介于 A < X 1 < ,第二个随机变量介于 C < X 2 < d ,第三个随机变量介于 e < X 3 < F ,我们三重积分区间(a,b)(c,d)(e,f)上的联合概率密度函数,我们写:

    A < X 1 < , C < X 2 < d , e < X 3 < F = e F C d A F X 1 , X 2 , X 3 d X 1 d X 2 d X 3
  • If we have more than two continuous random variables, then we use a higher-dimensional probability density function to represent their joint distribution. For example, if we have the height, weight, and blood pressure of males in a certain population, then we use a three-dimensional joint probability distribution function: f ( x 1 , x 2 , x 3 ) . Similar to the reasoning explained previously, to find the joint probability that the first random variable is between a < x 1 < b , the second random variable is between c < x 2 < d , and the third random variable is between e < x 3 < f , we triple integrate the joint probability density function over the intervals (a,b), (c,d), and (e,f), and we write:

    P ( a < x 1 < b , c < x 2 < d , e < x 3 < f ) = e f c d a b f ( x 1 , x 2 , x 3 ) d x 1 d x 2 d x 3

即使在为连续随机变量定义了概率密度函数之后,我们也并非所有的担忧(事物会在数学上相加)都被消除。同样,罪魁祸首是实数的无限精度。如果我们允许所有集合都有概率,我们就会遇到悖论,因为我们可以构造不相交的集合(例如分形集合或通过变换有理数集合而表示的集合),其概率加起来超过一!人们必须承认,这些场景是病态的,必须由一个有足够时间的人精心构建;然而,它们确实存在并产生悖论。数学中的测度论介入并提供了一个数学框架,我们可以在其中使用概率密度函数而不会遇到悖论。它定义了零测度集(这些集在我们工作的空间中不占据任何体积),然后为我们提供了大量定理,使我们几乎可以在任何地方进行计算,也就是说,除了零测度集之外。事实证明这对于我们来说已经足够了应用程序。

Not all our worries (that things will add up mathematically) are eliminated even after defining the probability density function for a continuous random variable. Again, the culprit is the infinite precision of real numbers. If we allow all sets to have probability, we encounter paradoxes in the sense that we can construct disjoint sets (such as fractal-shaped sets or sets formulated by transforming the set of rational numbers) whose probabilities add up to more than one! One must admit that these sets are pathological and must be carefully constructed by a person who has plenty of time on their hands; however, they exist and they produce paradoxes. Measure theory in mathematics steps in and provides a mathematical framework where we can work with probability density functions without encountering paradoxes. It defines sets of measure zero (these occupy no volume in the space that we are working in), then gives us plenty of theorems that allow us to do our computations almost everywhere, that is, except on sets of measure zero. This turns out to be more than enough for our applications.

联合概率密度函数的威力

The Power of the Joint Probability Density Function

拥有获得许多随机变量的联合概率分布是一件强大但罕见的事情。原因是联合概率分布在其中编码了每个单独随机变量的概率分布(边际分布),以及我们在这些随机变量之间遇到的所有可能的共现(和条件概率)。这类似于从上方观看整个城镇,而不是在城镇内部仅观察两条或更多小巷之间的一个交叉路口。

Having access to the joint probability distribution of many random variables is a powerful but rare thing. The reason is that the joint probability distribution encodes within it the probability distribution of each separate random variable (marginal distributions), as well as all the possible co-occurrences (and the conditional probabilities) that we ever encounter between these random variables. This is akin to seeing a whole town from above, rather than being inside the town and observing only one intersection between two or more alleys.

如果随机变量是独立的,则联合分布只是它们各自分布的乘积。然而,当随机变量不独立时,例如一个人的身高和体重,或者一个人的观测身高(包括测量噪声)和一个人的真实身高(不包括噪声),访问联合分发在存储方面要困难得多且昂贵。因随机变量情况下的联合分布是不可分离的,因此我们不能只单独存储其每个部分。我们需要存储两个或多个变量之间每个共现的每个值。随着相关随机变量数量的增加,存储需求(以及计算或搜索空间)呈指数增长,这是臭名昭著的维度诅咒

If the random variables are independent, then the joint distribution is simply the product of each of their individual distributions. However, when the random variables are not independent, such as the height and the weight of a person, or the observed height of a person (which includes measurement noise) and the true height of a person (which doesn’t include noise), accessing the joint distribution is much more difficult and expensive storage-wise. The joint distribution in the case of dependent random variables is not separable, so we cannot only store each of its parts alone. We need to store every value for every co-occurrence between the two or more variables. This exponential increase in storage requirements (and computations or search spaces) as you increase the number of dependent random variables is one embodiment of the infamous curse of dimensionality.

当我们对联合概率密度函数进行切片时,比如说 F X 1 , X 2 ,这意味着当我们将其中一个随机变量(或更高维度的多个)固定为精确值时,我们检索到的分布与我们通常感兴趣的后验概率分布(给定观测值的模型参数的概率分布)。例如,切片 F X 1 , X 2 X 1 = A ,我们得到 F A , X 2 ,它恰好与概率分布成正比 F X 2 | X 1 = A (见图2-7)。

When we slice a joint probability density function, say f ( x 1 , x 2 ) , meaning when we fix one of the random variables (or more in higher dimensions) to be an exact value, we retrieve a distribution proportional to the posterior probability distribution (probability distribution of the model parameters given the observations), which we are usually interested in. For example, slice through f ( x 1 , x 2 ) at x 1 = a , and we get f ( a , x 2 ) , which happens to be proportional to the probability distribution f ( x 2 | x 1 = a ) (see Figure 2-7).

电子邮件0207
图 2-7。对联合概率分布进行切片

同样,这是在我们知道联合概率分布的豪华情况下,否则,我们使用贝叶斯规则来获得相同的后验概率分布(使用先验分布和似然函数)。

Again, this is in the luxurious case that we know the joint probability distribution, otherwise, we use Bayes’ Rule to obtain the same posterior probability distribution (using the prior distribution and the likelihood function).

在一些人工智能应用中,人工智能系统通过使用概率乘积规则将联合概率分布分离为条件概率的乘积来学习联合概率分布。一旦它学习了联合分布,它就会从中采样以生成新​​的有趣的数据。DeepMind 的 WaveNet 在生成原始音频的过程中做到了这一点。

In some AI applications, the AI system learns the joint probability distribution by separating it into a product of conditional probabilities using the product rule for probabilities. Once it learns the joint distribution, it then samples from it to generate new and interesting data. DeepMind’s WaveNet does that in its process of generating raw audio.

接下来的部分介绍人工智能应用程序中最有用的概率分布。两种普遍存在的连续分布是均匀分布正态分布(也称为高斯分布),所以我们从这里开始。请参阅Jupyter Notebook来重现数字等细节。

The next sections introduce the most useful probability distributions for AI applications. Two ubiquitous continuous distributions are the uniform distribution and the normal distribution (also known as the Gaussian distribution), so we start there. Refer to the Jupyter Notebook for reproducing figures and more details.

数据分布:均匀分布

Distribution of Data: The Uniform Distribution

为了直观地理解均匀分布,让我们举一个非均匀分布的例子,我们在本章前面已经看到过。在我们真实的身高体重数据集中,我们不能使用均匀分布来对身高数据进行建模。我们也不能用它来对体重数据建模。原因是人类的身高和体重分布不均匀。在一般人群中,遇到身高在 7 英尺左右的人与遇到身高在 5 英尺 6 英寸左右的人的可能性并不相同。

To intuitively understand the uniform distribution, let’s give an example of a nonuniform distribution, which we have already seen earlier in this chapter. In our real height-weight data sets, we cannot use the uniform distribution to model the height data. We also cannot use it to model the weight data. The reason is that human heights and weights are not evenly distributed. In the general population, it is not equally likely to encounter a person with height around 7 feet as it is to encounter a person with height around 5 feet 6 inches.

均匀分布仅对均匀分布的数据进行建模。如果我们有一个区间 X n , X AX 包含之间连续体中的所有值 X n X AX 的数据,并且我们的数据在我们的区间内均匀分布,那么观察到我们区间中任何特定值附近的数据点的概率对于该区间中的所有值都是相同的。也就是说,如果我们的间隔是 0 , 1 ,选择 0.2 附近的点的可能性与选择 0.75 附近的点的可能性相同。

The uniform distribution only models data that is evenly distributed. If we have an interval ( x min , x max ) containing all the values in the continuum between x min and x max of our data, and our data is uniformly distributed over our interval, then the probability of observing a data point near any particular value in our interval is the same for all values in this interval. That is, if our interval is ( 0 , 1 ) , it is equally likely to pick a point near 0.2 as it is to pick a point near 0.75.

因此,均匀分布的概率密度函数是常数。对于一个区间内的一个随机变量 x X n , X AX ,连续均匀分布的概率密度函数的公式由下式给出:

The probability density function for the uniform distribution is therefore constant. For one random variable x over an interval ( x min , x max ) , the formula for the probability density function for the continuous uniform distribution is given by:

F X ; X n , X AX = 1 X AX -X n 为了 X n < X < X AX

否则为零。

and zero otherwise.

让我们绘制一个区间内均匀分布的概率密度函数 X n , X AX 图 2-8中的图表是一条直线段,因为均匀分布的数据(无论是真实数据还是模拟数据)似乎均匀分布在所考虑的整个区间内。该区间内的数据值没有比其他值更容易出现。

Let’s plot the probability density function for the uniform distribution over an interval ( x min , x max ) . The graph in Figure 2-8 is a straight segment because uniformly distributed data, whether real or simulated, appears evenly distributed across the entire interval under consideration. No data values within the interval are more favored to appear than others.

电子邮件0208
图 2-8。区间 [0,1] 上均匀分布的概率密度函数图

均匀分布在计算机模拟中对于从任何其他概率分布生成随机数非常有用。如果您查看 Python 使用的随机数生成器,您会发现底层某处使用了均匀分布算法。

The uniform distribution is extremely useful in computer simulations for generating random numbers from any other probability distribution. If you peek into the random number generators that Python uses, you would see the uniform distribution used somewhere in the underlying algorithms.

数据分布:钟形正态(高斯)分布

Distribution of Data: The Bell-Shaped Normal (Gaussian) Distribution

A更适合模拟人体身高数据(当仅限于一种性别时)的连续概率分布是钟形正态分布,也称为高斯分布。正态分布的样本倾向于聚集在分布峰值处的平均值附近,称为平均值_ μ ,然后随着我们远离平均值而对称地缩小。分布在缩小时与平均值的距离有多远由正态分布的第二个参数控制,称为标准差 σ 。大约 68% 的数据落在平均值的 1 个标准差范围内,95% 的数据落在平均值的 2 个标准差范围内,大约 99.7% 的数据落在平均值的 3 个标准差范围内(图2-9) 。

A continuous probability distribution better suited to model human height data (when restricted to one gender) is the bell-shaped normal distribution, also called the Gaussian distribution. Samples from the normal distribution tend to congregate around an average value where the distribution peaks, called the mean μ , then dwindle symmetrically as we get farther away from the mean. How far from the mean the distribution spreads out as it dwindles down is controlled by the second parameter of the normal distribution, called the standard deviation σ . About 68% of the data falls within one standard deviation of the mean, 95% of the data falls within two standard deviations of the mean, and about 99.7% of the data falls within three standard deviations of the mean (Figure 2-9).

电子邮件0209
图 2-9。带参数的钟形正态分布的概率密度函数图 μ = 0 σ = 1

当我们从正态分布中采样数据时,接近平均值的值更有可能被选取(或发生或被观察),并且值非常小( - 无穷大 )或非常大( 无穷大 )被选中的可能性较小。这种在平均值附近达到峰值并在分布的外边缘衰减的现象使该分布具有著名的钟形形状。请注意,还有其他钟形连续分布,但正态分布是最普遍的。它为这个当之无愧的声誉提供了一个简洁的数学理由,基于概率论中的一个重要定理,称为中心极限定理(CLT)。

Values near the mean are more likely to be picked (or to occur, or to be observed) when we are sampling data from the normal distribution, and values that are very small ( - ) or very large ( ) are less likely to be picked. This peaking near the mean value and decaying on the outer skirts of the distribution give this distribution its famous bell shape. Note that there are other bell-shaped continuous distributions out there, but the normal distribution is the most prevalent. It has a neat mathematical justification for this well-deserved fame, based on an important theorem in probability theory called the central limit theorem (CLT).

中心极限定理指出,许多具有相同分布(不一定是正态分布)的独立随机变量的平均值呈正态分布。这就解释了为什么正态分布在社会和自然界中随处可见。它对婴儿出生体重、学生年​​级分布、国家收入分布、血压测量分布等进行建模。有特殊的统计测试可以帮助我们确定是否可以使用正态分布对真实数据集进行建模。我们将在第 11 章后面详细阐述这些想法。

The central limit theorem states that the average of many independent random variables that all have the same distribution (not necessarily the normal distribution) is normally distributed. This explains why the normal distribution appears everywhere in society and nature. It models baby birth weights, student grade distributions, countries’ income distributions, distribution of blood pressure measurements, etc. There are special statistical tests that help us determine whether a real data set can be modeled using the normal distribution. We will expand on these ideas later in Chapter 11.

如果您碰巧发现自己处于不确定并且不知道应用程序应使用哪种分布的情况,那么正态分布通常是一个合理的选择。事实上,在所有具有相同方差的分布选择中,正态分布就是选择最大的不确定性,因此它实际上将最少量的先验知识编码到模型中。

If you happen to find yourself in a situation where you are uncertain and have no prior knowledge about which distribution to use for your application, the normal distribution is usually a reasonable choice. In fact, among all choices of distributions with the same variance, the normal distribution is the choice with maximum uncertainty, so it does in fact encode the least amount of prior knowledge into your model.

正态分布的概率密度函数公式为一个具有均值的随机变量x (单变量) μ 和标准差 σ 是:

The formula for the probability density function of the normal distribution for one random variable x (univariate) with mean μ and standard deviation σ is:

G X ; μ , σ = 1 2πσ 2 e -X-μ 2 2σ 2

及其图表 μ = 0 σ = 1 绘制如图2-9所示。

and its graph for μ = 0 and σ = 1 is plotted in Figure 2-9.

两个随机数正态分布的概率密度函数公式变量xy(双变量)是:

The formula for the probability density function for the normal distribution of two random variables x and y (bivariate) is:

G X , y ; μ 1 , σ 1 , μ 2 , σ 2 , ρ = 1 2π 2 德特σ 1 2 ρσ 1 σ 2 ρσ 1 σ 2 σ 2 2 e - 1 2 X - μ 1 y - μ 2 σ 1 2 ρσ 1 σ 2 ρσ 1 σ 2 σ 2 2 -1 X - μ 1 y - μ 2

其图形如图2-10所示。

and its graph is plotted in Figure 2-10.

我们可以使用线性代数语言以更紧凑的表示法写出上面的二元公式:

We can write the above bivariate formula in more compact notation using the language of linear algebra:

G X , y ; μ , Σ = 1 2π 2 德特Σ e - 1 2 -μ 时间 Σ -1 - μ

图 2-11中,我们从二元正态分布中采样了 6,000 个点。靠近中心的点更有可能被选取,而远离中心的点则不太可能被选取。如果我们只观察样本点而不知道它们是从哪个分布中采样的,那么这些线大致描绘了正态分布的轮廓线。

In Figure 2-11, we sample 6,000 points from the bivariate normal distribution. Points near the center are more likely to be picked, and points away from the center are less likely to be picked. The lines roughly trace the contour lines of the normal distribution, had we only observed the sample points without knowing which distribution they were sampled from.

电子邮件0210
图 2-10。钟形二元正态分布的概率密度函数图
电子邮件 0211
图 2-11。从二元正态分布中采样 6,000 个点

让我们暂停一下,将二元正态分布的概率密度函数公式与单变量正态分布的概率密度函数公式进行比较:

Let’s pause and compare the formula of the probability density function for the bivariate normal distribution to the formula of the probability density function for the univariate normal distribution:

  • 当只有一个随机变量时,我们只有一个均值 μ 和一个标准差 σ

  • When there is only one random variable, we only have one mean μ and one standard deviation σ .

  • 当有两个随机变量时,我们有两种均值 μ 1 μ 2 和两个标准差 σ 1 σ 2 。产品 σ 2 将被协方差矩阵替换 Σ = σ 1 2 ρ σ 1 σ 2 ρ σ 1 σ 2 σ 2 2 及其决定因素。 ρ 是两个随机变量之间的相关性,它是随机变量的两个归一化版本的协方差。

  • When there are two random variables, we have two means μ 1 μ 2 and two standard deviations σ 1 σ 2 . The product σ 2 will be replaced by the covariance matrix Σ = σ 1 2 ρ σ 1 σ 2 ρ σ 1 σ 2 σ 2 2 and its determinant. ρ is the correlation between the two random variables, which is the covariance of the two normalized versions of the random variables.

二元正态分布的概率密度函数的相同精确公式可以推广到任何维度,其中我们有许多随机变量而不是只有两个随机变量。例如,如果我们有 100 个随机变量,代表数据集中的 100 个特征,则公式中的均值向量将包含 100 个条目,协方差矩阵的大小为 100 × 100 ,对角线上每个随机变量的方差以及对角线上 4,950 对随机变量中每一对的协方差对角线。

The same exact formula for the probability density function of the bivariate normal distribution generalizes to any dimension, where we have many random variables instead of only two random variables. For example, if we have 100 random variables, representing 100 features in a data set, the mean vector in the formula will have 100 entries in it, and the covariance matrix will have the size 100 × 100 , with the variance of each random variable on the diagonal and the covariance of each of the 4,950 pairs of random variables off the diagonal.

数据分布:其他重要且常用的分布

Distribution of Data: Other Important and Commonly Used Distributions

几乎所有你在本章中不理解的内容都会在本书中多次重温,第 11 章专门讨论概率。当这些概念在各种有趣的环境中一次又一次出现时,它们将会得到强化。本章的目标是接触概率和统计词汇,并为人工智能应用中经常出现的重要思想提供一个指导图。我们还希望对接下来的章节有一个良好的概率直觉,而不必深入研究并无缘无故地拖延我们的进度。

Almost everything you did not understand in this chapter will be revisited many times throughout the book, and Chapter 11 focuses exclusively on probability. The concepts will get reinforced as they appear again and again in various interesting contexts. Our goal for this chapter is to get exposed to the vocabulary of probability and statistics, and have a guiding map for the important ideas that frequently appear in AI applications. We also want to acquire a good probabilistic intuition for the following chapters without having to take a deep dive and delay our progress for no necessary reason.

那里有很多概率分布。每个模型都模拟了不同类型的现实世界场景。均匀分布和正态分布非常常见,但我们还有其他在人工智能领域经常出现的重要分布。回想一下,我们的目标是对我们周围的世界进行建模,以便做出良好的设计、预测和/或决策。当我们的模型涉及随机性或当我们不确定结果时,概率分布可以帮助我们做出预测。

There are many probability distributions out there. Each models a different type of real-world scenario. The uniform and normal distributions are very common, but we have other important distributions that frequently appear in the AI field. Recall that our goal is to model the world around us in order to make good designs, predictions, and/or decisions. Probability distributions help us make predictions when our models involve randomness or when we are uncertain about our outcomes.

当我们研究分布时,一个令人沮丧的部分是,它们中的大多数都有奇怪的名称,这些名称对于给定的分布对哪种现象有用提供了零直觉。这使得我们要么花费额外的精力来记住这些名称,要么在口袋里放一张发行备忘单。我更喜欢保留一份备忘单。另一个令人沮丧的部分是,大多数教科书的例子都涉及掷硬币、掷骰子或从瓮中抽彩球。这让我们没有现实生活中的例子或动机来理解这个主题,因为我从来没有见过有人四处走动抛硬币并数正面或反面,除了《黑暗骑士》中的双面人(又名哈维·登特)(2008 年的一部非常好的电影)电影中,小丑(希斯·莱杰饰演)说了一些我最喜欢的关于随机性和机会的深刻言论,比如:“世界是残酷的。残酷世界中唯一的道德……就是机会。不带偏见。不带偏见。” 。 公平的。”)。我将在本书中尽可能多地修改这一点,在页数限制允许的范围内指出尽可能多的现实世界的例子。

When we study distributions, one frustrating part is that most of them have weird names that provide zero intuition about what kind of phenomena a given distribution would be useful for. This makes us either expend extra mental energy to memorize these names, or keep a distribution cheat sheet in our pocket. I prefer to keep a cheat sheet. Another frustrating part is that most textbook examples involve flipping a coin, rolling a die, or drawing colored balls from urns. This leaves us with no real-life examples or motivation to understand the subject, as I never met anyone walking around flipping coins and counting heads or tails, except for Two-Face (aka Harvey Dent) in The Dark Knight (a really good 2008 movie, where the Joker [played by Heath Ledger] says some of my favorite and profound statements about randomness and chance, like this one: “The world is cruel. And the only morality in a cruel world…​is chance. Unbiased. Unprejudiced. Fair.”). I will try to amend this as much as I can in this book, pointing to as many real-world examples as my page limit allows.

以下一些分布在数学上彼此相关,或者自然地遵循其他分布。我们将在第 10 章中探讨这些关系。现在,让我们命名一个流行的分布,说明它是离散的(预测我们关心的事物的计数)还是连续的(预测连续体中存在的数量,例如某件事发生之前需要经过的时间;小心,这不是小时数,因为小时数是离散的,而是时间段的长度),说明控制它的参数,并说明其对我们的 AI 应用程序有用的定义属性:

Some of the following distributions are mathematically related to each other, or follow naturally from others. We will explore these relationships in Chapter 10. For now, let’s name a popular distribution, state whether it is discrete (predicts a count of something that we care for) or continuous (predicts a quantity that exists in the continuum, such as the time needed to elapse before something happens; careful, this is not the number of hours, since the number of hours is discrete, but it is the length of the time period), state the parameters that control it, and state its defining properties that are useful for our AI applications:

二项分布
Binomial distribution

是离散的。它表示多次独立重复一项实验时获得一定数量成功的概率。它的控制参数是n,我们执行的实验数量,以及p预定义的成功概率。现实世界的例子包括预测临床试验中疫苗或新药出现副作用的患者数量、导致购买的广告点击数量以及每月违约的客户数量信用卡付款。当我们使用需要独立试验的概率分布对现实世界的示例进行建模时,这意味着即使现实世界的试验并非真正独立,我们也假设独立。指出我们模型的假设是一种良好的礼仪。

This is discrete. It represents the probability of obtaining a certain number of successes when repeating one experiment, independently, multiple times. Its controlling parameters are n, the number of experiments we perform, and p, the predefined probability of success. Real-world examples include predicting the number of patients that will develop side effects for a vaccine or a new medication in a clinical trial, the number of ad clicks that will result in a purchase, and the number of customers that will default on their monthly credit card payments. When we model examples from the real world using a probability distribution that requires independent trials, it means that we are assuming independence even if the real-world trials are not really independent. It is good etiquette to point our models’ assumptions.

泊松分布
Poisson distribution

这是离散的。它预测在给定时间内将发生的罕见事件的数量。这些事件是独立的或弱相关的,意味着事件的发生一次不会影响其在同一时间段内下一次发生的概率。它们也以已知且恒定的平均速率发生 λ 。因此,我们知道平均发生率,并且我们想要预测在特定时间段内会发生多少此类事件。泊松分布的控制参数是预定义的罕见事件率 λ 。现实世界的例子包括预测特定小时内出生的婴儿数量、人口中 98 岁以上的人数、特定时间段内从放射性系统释放的 α 粒子数量、发送的重复账单数量美国国税局 (IRS) 统计的数据、某一天销售的不太受欢迎的产品数量、本书一页包含的错别字数量、某一天某台机器生产的缺陷产品数量、特定时间进入商店的人数、保险公司需要在特定时间段内承保的车祸次数以及特定时间段内发生的地震次数。

This is discrete. It predicts the number of rare events that will occur in a given period of time. These events are independent or weakly dependent, meaning that the occurrence of the event once does not affect the probability of its next occurrence in the same time period. They also occur at a known and constant average rate λ . Thus, we know the average rate, and we want to predict how many of these events will happen during a certain time period. The Poisson distribution’s controlling parameter is the predefined rare event rate λ . Real-world examples include predicting the number of babies born in a given hour, the number of people in a population who age past 98, the number of alpha particles discharged from a radioactive system during a certain time period, the number of duplicate bills sent out by the IRS, the number of a not-too-popular product sold on a particular day, the number of typos that one page of this book contains, the number of defective items produced by a certain machine on a certain day, the number of people entering a store at a certain hour, the number of car crashes an insurance company needs to cover within a certain time period, and the number of earthquakes happening within a particular time period.

几何分布
Geometric distribution

这是离散的。它预测我们在进行独立试验时获得成功之前所需的试验次数,每个试验都有一个已知的成功概率p 。这里的控制参数显然是成功的概率p 。现实世界的例子包括估计一家公司在没有出现网络故障的情况下可以运行的周数、一台机器在生产出有缺陷的产品之前可以运行的小时数,或者在遇到反对某个产品的人之前我们需要采访的人数。我们想要通过的某些政治法案。同样,对于这些现实世界的例子,如果使用几何分布建模,我们可能会假设独立性,而实际上试验可能不是独立的。

This is discrete. It predicts the number of trials needed before we obtain a success when performing independent trials, each with a known probability p for success. The controlling parameter here is obviously the probability p for success. Real-world examples include estimating the number of weeks that a company can function without experiencing a network failure, the number of hours a machine can function before producing a defective item, or the number of people we need to interview before meeting someone who opposes a certain political bill that we want to pass. Again, for these real-world examples, we might be assuming independence if modeling using the geometric distribution, while in reality the trials might not be independent.

指数分布
Exponential distribution

这是连续的。如果我们碰巧知道某个事件以恒定速率发生 λ ,然后指数分布预测该事件发生之前的等待时间。它是无记忆的,从某种意义上说,属于这种指数分布的项目的剩余寿命也是指数的。控制参数是恒定速率 λ 。现实世界的例子包括我们必须等待地震发生的时间、直到某人拖欠贷款的时间、直到机器零件发生故障的时间,或者恐怖袭击发生之前的时间。这对于可靠性领域非常有用,其中计算了某个机器部件的可靠性,因此有 10 年保修等声明。

This is continuous. If we happen to know that a certain event occurs at a constant rate λ , then exponential distribution predicts the waiting time until this event occurs. It is memoryless, in the sense that the remaining lifetime of an item that belongs to this exponential distribution is also exponential. The controlling parameter is the constant rate λ . Real-world examples include the amount of time we have to wait until an earthquake occurs, the time until someone defaults on a loan, the time until a machine part fails, or the time before a terrorist attack strikes. This is very useful for the reliability field, where the reliability of a certain machine part is calculated, hence statements such as a 10-year guarantee, etc.

威布尔分布
Weibull distribution

这是连续的。它广泛应用于预测产品寿命的工程领域(10 年保修声明也适用于此)。在这里,一个产品由许多部件组成,如果其中任何一个部件出现故障,那么该产品就会停止工作。例如,如果电池出现故障,或者变速箱中的保险丝烧断,汽车将无法工作。在考虑了汽车的许多部件及其最薄弱的环节(假设我们不维护汽车并重置时钟)之后,威布尔分布为汽车停止工作之前的使用寿命提供了一个很好的近似值。它由三个参数控制:形状、规模和位置。指数分布是该分布的特例,因为指数分布具有恒定的事件发生率,但威布尔分布可以对随时间增加或减少的发生率进行建模。

This is continuous. It is widely used in engineering in the field of predicting product lifetimes (10-year warranty statements are appropriate here as well). Here, a product consists of many parts, and if any of its parts fail, then the product stops working. For example, a car will not work if the battery fails, or if a fuse in the gearbox burns out. A Weibull distribution provides a good approximation for the lifetime of a car before it stops working, after accounting for its many parts and their weakest link (assuming we are not maintaining the car and resetting the clock). It is controlled by three parameters: shape, scale, and location. The exponential distribution is a special case of this distribution, because the exponential distribution has a constant rate of event occurrence, but the Weibull distribution can model rates of occurrence that increase or decrease with time.

对数正态分布
Log-normal distribution

这是连续的。如果我们取此分布中提供的每个值的对数,我们就会得到正态分布的数据。这意味着一开始,您的数据可能不会显示为正态分布,但如果您尝试使用日志函数对其进行转换,您将看到正态分布的数据。当遇到均值低、方差大且 仅假设正值的偏斜数据时,这是一个很好的分布。就像当您对随机变量的许多独立样本进行平均时(使用中心极限定理)会出现正态分布一样,当您取许多正样本值的乘积时会出现对数正态分布。从数学上讲,这是由于对数函数的一个令人敬畏的属性:乘积的对数是对数的总和。这种分布由三个参数控制:形状、尺度和位置。现实世界的例子包括石油储备中的天然气量,以及一天结束时证券价格与前一天结束时价格的比率。

This is continuous. If we take the logarithms of each value provided in this distribution, we get normally distributed data. Meaning that in the beginning, your data might not appear normally distributed, but if you try transforming it using the log function, you will see normally distributed data. This is a good distribution to use when encountering skewed data with low mean value, large variance, and assuming only positive values. Just like the normal distribution appears when you average many independent samples of a random variable (using the central limit theorem), the log-normal distribution appears when you take the product of many positive sample values. Mathematically, this is due to an awesome property of log functions: the log of a product is a sum of the logs. This distribution is controlled by three parameters: shape, scale, and location. Real-world examples include the volume of gas in a petroleum reserve, and the ratio of the price of a security at the end of one day to its price at the end of the day before.

卡方分布
Chi-squared distribution

这是连续的。它是正态分布的独立随机变量的平方和的分布。您可能想知道为什么我们要关心正态分布随机变量的平方,然后将它们相加。答案是,这就是我们通常计算随机变量或数据样本方差的方式,我们的主要目标之一是控制方差以降低不确定性。与此分布相关的显着性检验有两种类型:拟合优度检验(衡量我们的期望与观察结果的偏差程度)以及数据特征检验的独立性和同质性。

This is continuous. It is a distribution for the sum of squares of normally distributed independent random variables. You might wonder why would we care about squaring normally distributed random variables, then adding them up. The answer is that this is how we usually compute the variance of a random variable or of a data sample, and one of our main goals is controlling the variance in order to lower our uncertainties. There are two types of significance tests associated with this distribution: the goodness of fit test, which measures how far off our expectation is from our observation, and the independence and homogeneity of data features test.

帕累托分布
Pareto distribution

是连续的。它对于许多现实世界的应用程序都很有用,例如完成分配给超级计算机的工作的时间(想想机器学习计算)、特定人群的家庭收入水平、社交网络中的朋友数量以及文件互联网流量的大小。该分布仅由一个参数控制 α ,并且它是重尾的(它的尾部比指数分布更重)。

This is continuous. It is useful for many real-world applications, such as the time to complete a job assigned to a supercomputer (think machine learning computations), the household income level in a certain population, the number of friends in a social network, and the file size of internet traffic. This distribution is controlled by only one parameter α , and it is heavy tailed (its tail is heavier than the exponential distribution).

在继续之前,让我们先介绍一些其他发行版,不要担心任何细节。这些都或多或少与上述发行版相关:

Let’s throw in few other distributions before moving on, without fussing about any of the details. These are all more or less related to the aforementioned distributions:

学生的 t 分布
Student’s t-distribution

连续,与正态分布类似,但在样本量较小且总体方差未知时使用。

Continuous, similar to the normal distribution, but used when the sample size is small and the population variance is unknown.

贝塔分布
Beta distribution

连续,在给定间隔内生成随机值。

Continuous, produces random values in a given interval.

柯西分布
Cauchy distribution

连续的、病态的,因为它的均值和方差都没有定义,可以使用随机选择的角度的正切来获得。

Continuous, pathological because neither its mean nor its variance is defined, can be obtained using the tangents of randomly chosen angles.

伽马分布
Gamma distribution

连续,与发生n 个独立事件之前的等待时间有关,而不是像指数分布那样仅发生一个事件。

Continuous, has to do with the waiting time until n independent events occur, instead of only one event, as in the exponential distribution.

负二项分布
Negative binomial distribution

离散,与获得一定数量的成功所需的独立试验的数量有关。

Discrete, has to do with the number of independent trials needed to obtain a certain number of successes.

超几何分布
Hypergeometric distribution

离散,类似于二项式,但试验不是独立的。

Discrete, similar to the binomial but the trials are not independent.

负超几何分布
Negative hypergeometric distribution

离散,捕获在我们获得一定数量的成功之前所需的相关试验的数量。

Discrete, captures the number of dependent trials needed before we obtain a certain number of successes.

“分发”一词的各种用法

The Various Uses of the Word “Distribution”

可能已经注意到,“分布”一词指的是许多不同(但相关)的概念,具体取决于上下文。对同一词的不一致使用可能会造成混乱,并立即让一些试图进入该领域的人感到厌烦。

You might have already noticed that the word distribution refers to many different (but related) concepts, depending on the context. This inconsistent use of the same word could be a source of confusion and an immediate turnoff for some people who are trying to enter the field.

让我们列出“分布”一词所指的不同概念,以便我们在给定上下文中轻松识别其含义:

Let’s list the different concepts that the word distribution refers to, so that we easily recognize its intended meaning in a given context:

  • 如果您有真实的数据,例如本章中的身高体重数据,并绘制数据集的一个特征(例如身高)的直方图,那么您会得到高度数据的经验分布。您通常不知道整个人口身高的潜在概率密度函数(也称为分布),因为您拥有的真实数据只是该人口的样本。因此,您尝试使用概率论给出的概率分布来估计它或对其进行建模。对于身高和体重特征,当按性别分开时,高斯分布是合适的。

  • If you have real data, such as the height-weight data in this chapter, and plot the histogram of one feature of your data set, such as the height, then you get the empirical distribution of the height data. You usually do not know the underlying probability density function of the height of the entire population, also called distribution, since the real data you have is only a sample of that population. So you try to estimate it, or model it, using the probability distributions given by probability theory. For the height and weight features, when separated by gender, a Gaussian distribution is appropriate.

  • 如果您有一个离散随机变量,则“分布”一词可以指其概率质量函数或其累积分布函数(指定随机变量小于或等于某个值的概率, F X = p r X X )。

  • If you have a discrete random variable, the word distribution could refer to either its probability mass function or its cumulative distribution function (which specifies the probability that the random variable is less than or equal to a certain value, f ( x ) = p r o b ( X x ) ).

  • 如果您有一个连续随机变量,则单词分布可以指其概率密度函数或其累积分布函数,其积分给出随机变量小于或等于某个值的概率。

  • If you have a continuous random variable, the word distribution could refer to either its probability density function or its cumulative distribution function, whose integral gives the probability that the random variable is less than or equal to a certain value.

  • 如果您有多个随机变量(离散、连续或两者的混合),则单词分布指的是它们的联合概率分布。

  • If you have multiple random variables (discrete, continuous, or a mix of both), then the word distribution refers to their joint probability distribution.

一个共同的目标是在理想化的数学函数(例如具有适当分布的随机变量)与具有观察到的经验分布的真实观察到的数据或现象之间建立适当的对应关系。在处理真实数据时,可以使用随机变量对数据集的每个特征进行建模。因此,在某种程度上,数学随机变量及其相应的分布是我们测量或观察到的特征的理想化版本。

A common goal is to establish an appropriate correspondence between an idealized mathematical function, such as a random variable with an appropriate distribution, and real observed data or phenomena, with an observed empirical distribution. When working with real data, each feature of the data set can be modeled using a random variable. So in a way, a mathematical random variable with its corresponding distribution is an idealized version of our measured or observed feature.

最后,分布在人工智能应用中随处可见。在接下来的章节中我们会多次遇到它们,例如神经网络每一层权重的分布,以及各种机器学习所犯的噪声和错误的分布楷模。

Finally, distributions appear everywhere in AI applications. We will encounter them plenty of times in the next chapters, for example, the distribution of the weights at each layer of a neural network, and the distribution of the noise and errors committed by various machine learning models.

A/B 测试

A/B Testing

离开本章,我们将绕道进入 A/B 测试的世界,也称为对比测试,或随机单盲双盲试验。我们绕道而行是因为这对数据科学家来说很重要:无数公司依靠 A/B 测试的数据来提高参与度、收入和客户满意度。微软、亚马逊、LinkedIn、谷歌和其他公司每年都会进行数千次 A/B 测试。

Before leaving this chapter, we make a tiny detour into the world of A/B testing, also called split testing, or randomized single-blind or double-blind trials. We make this detour because this is a topic important for data scientists: countless companies rely on data from A/B tests to increase engagement, revenue, and customer satisfaction. Microsoft, Amazon, LinkedIn, Google, and others each conduct thousands of A/B tests annually.

A/B 测试的想法很简单:将总体分为两组。将您想要测试的内容的版本(新的网页设计、不同的字体大小、新药物、新的政治广告)推出给一组(测试组),并将另一组保留为对照组。比较两组之间的数据。

The idea of an A/B test is simple: split the population into two groups. Roll out a version of something you want to test (a new web page design, a different font size, a new medicine, a new political ad) to one group, the test group, and keep the other group as a control group. Compare the data between the two groups.

考试如果受试者不知道他们属于哪个组(有些人甚至根本不知道他们正在参加测试),但实验者知道,则为单盲如果实验者和受试者都不知道他们正在与哪一组进行交互,则测试是双盲的。

The test is single blind if the subjects do not know which group they belong to (some do not even know that they are in a test at all), but the experimenters know. The test is double blind if neither the experimenters nor the subjects know which group they are interacting with.

总结与展望

Summary and Looking Ahead

在本章中,我们强调了数据是人工智能的核心这一事实。我们还澄清了通常容易混淆的概念之间的差异:结构化和非结构化数据、线性和非线性模型、真实和模拟数据、确定性函数和随机变量、离散和连续分布以及后验概率和似然函数。我们还提供了人工智能所需的概率和统计图,而无需深入研究任何细节,并且我们还介绍了最流行的概率分布。

In this chapter, we emphasized the fact that data is central to AI. We also clarified the differences between concepts that are usually a source of confusion: structured and unstructured data, linear and nonlinear models, real and simulated data, deterministic functions and random variables, discrete and continuous distributions, and posterior probabilities and likelihood functions. We also provided a map for the probability and statistics needed for AI without diving into any of the details, and we introduced the most popular probability distributions.

如果您发现自己迷失在一些新的概率概念中,您可能需要查阅本章提供的地图,看看该概念如何适应概率论的大局,最重要的是,它与人工智能有何关系。如果不知道特定的数学概念与人工智能有何关系,你就会拥有一些你知道如何打开的工具,但你不知道它的用途。

If you find yourself lost in some new probability concept, you might want to consult the map provided in this chapter and see how that concept fits within the big picture of probability theory, and most importantly, how it relates to AI. Without knowing how a particular mathematical concept relates to AI, you are left with having some tool that you know how to turn on, but you have no idea what it is used for.

我们还没有提到随机矩阵高维概率。在这些领域中,概率论不断跟踪任何相关随机量的分布、期望和方差,与高度关注特征值和各种矩阵分解的线性代数相结合。这些领域对于人工智能应用中涉及的极高维数据非常重要。我们将在第 11 章概率中讨论它们。

We have not yet mentioned random matrices and high-dimensional probability. In these fields, probability theory, with its constant tracking of distributions, expectations, and variances of any relevant random quantities, merges with linear algebra, with its hyperfocus on eigenvalues and various matrix decompositions. These fields are very important for the extremely high-dimensional data that is involved in AI applications. We discuss them in Chapter 11 on probability.

在下一章中,我们学习如何将数据拟合到函数中,然后使用该函数进行预测和/或决策。从数学上讲,我们找到权重( ω 's)描述了我们的数据特征之间的各种相互作用的强度。当我们描述所涉及的交互类型(拟合函数的公式,称为学习训练函数)以及这些交互的强度( ω 's),我们可以做出我们的预测。在人工智能中,用合适的权重值来表征拟合函数的这一概念可以成功地用于计算机视觉、自然语言处理、预测分析(如房价、维护时间等)和许多其他应用。

In the next chapter, we learn how to fit our data into a function, then use this function to make predictions and/or decisions. Mathematically, we find the weights (the ω ’s) that characterize the strengths of various interactions between the features of our data. When we characterize the involved types of interactions (the formula of the fitting function, called the learning or training function) and the strengths of these interactions (the values of the ω ’s), we can make our predictions. In AI, this one concept of characterizing the fitting function with its suitable weight values can be used successfully for computer vision, natural language processing, predictive analytics (like house prices, time until maintenance, etc.), and many other applications.

第 3 章将函数拟合到数据

Chapter 3. Fitting Functions to Data

今天正好合适。明天?

H。

Today it fits. Tomorrow?

H.

在本章中,我们介绍许多人工智能应用的核心数学思想,包括神经网络的数学引擎。我们的目标是将机器学习部分的以下结构内化一个人工智能问题:

In this chapter, we introduce the core mathematical ideas lying at the heart of many AI applications, including the mathematical engines of neural networks. Our goal is to internalize the following structure of the machine learning part of an AI problem:

找出问题所在
Identify the problem

问题取决于具体的用例:对图像进行分类、对文档进行分类、预测房价、检测欺诈或异常、推荐下一个产品、预测犯罪再犯罪的可能性、根据外部图像预测建筑物的内部结构、转换语音生成文本、生成音频、生成图像、生成视频等。

The problem depends on the specific use case: classify images, classify documents, predict house prices, detect fraud or anomalies, recommend the next product, predict the likelihood of a criminal reoffending, predict the internal structure of a building given external images, convert speech to text, generate audio, generate images, generate video, etc.

获取适当的数据
Acquire the appropriate data

这是关于训练我们的模型做正确的事。我们说我们的模型从数据中学习。确保这些数据是干净的、完整的,如有必要,取决于我们正在实现、转换的特定模型(标准化、标准化、聚合的一些特征等)。此步骤通常比实施和训练机器学习模型更耗时。

This is about training our models to do the right thing. We say that our models learn from the data. Make sure this data is clean, complete, and if necessary, depending on the specific model we are implementing, transformed (normalized, standardized, some features aggregated, etc.). This step is usually way more time-consuming than implementing and training the machine learning models.

创建假设函数
Create a hypothesis function

我们用术语假设函数、学习函数预测函数训练函数模型可以互换。我们的主要假设是这个输入/输出数学函数解释了观察到的数据,并且可以在以后用于对新数据进行预测。我们给模型提供特征,比如一个人的日常习惯,它会返回一个预测,比如这个人偿还贷款的可能性。在本章中,我们将为模型提供鱼的长度测量值,并且它将返回其重量。

We use the terms hypothesis function, learning function, prediction function, training function, and model interchangeably. Our main assumption is that this input/output mathematical function explains the observed data, and it can be used later to make predictions on new data. We give our model features, like a person’s daily habits, and it returns a prediction, like this person’s likelihood to pay back a loan. In this chapter, we will give our model the length measurements of a fish, and it will return its weight.

求权重的数值
Find the numerical values of weights

我们将遇到许多模型(包括神经网络),其中我们的训练函数具有称为权重的未知参数。目标是使用数据找到这些权重的数值。找到这些权重值后,我们可以通过将新数据点的特征代入训练函数的公式中,使用训练函数进行预测。

We will encounter many models (including neural networks) where our training function has unknown parameters called weights. The goal is to find the numerical values of these weights using the data. After we find these weight values, we can use the trained function to make predictions by plugging the features of a new data point into the formula of the trained function.

创建误差函数
Create an error function

为了找到未知权重的值,我们创建另一个函数,称为误差函数成本函数目标函数损失函数(人工智能领域的所有内容都有三个或更多名称)。该函数必须测量真实情况与我们的预测之间的某种距离。当然,我们希望我们的预测尽可能接近真实情况,因此我们寻找能够最小化损失函数的权重值。在数学上,我们解决了一个最小化问题。数学优化领域对于人工智能至关重要。

To find the values of the unknown weights, we create another function called the error function, cost function, objective function, or loss function (everything in the AI field has three or more names). This function has to measure some sort of distance between the ground truth and our predictions. Naturally, we want our predictions to be as close to the ground truth as possible, so we search for weight values that minimize our loss function. Mathematically, we solve a minimization problem. The field of mathematical optimization is essential to AI.

决定数学公式
Decide on mathematical formulas

在整个过程中,我们是工程师,因此我们可以决定训练函数、损失函数、优化方法和计算机实现的数学公式。不同的工程师决定不同的流程,会产生不同的性能结果,但这没关系。最终,评判标准是所部署模型的性能,与普遍看法相反,数学模型是灵活的,可以在需要时进行调整和更改。部署后监控性能至关重要。

Throughout this process, we are the engineers, so we get to decide on the mathematical formulas for training functions, loss functions, optimization methods, and computer implementations. Different engineers decide on different processes, with different performance results, and that is OK. The judge, in the end, is the performance of the deployed model, and contrary to popular belief, mathematical models are flexible and can be tweaked and altered when needed. It is crucial to monitor performance after deployment.

找到一种搜索最小化器的方法
Find a way to search for minimizers

由于我们的目标是找到使我们的预测与真实情况之间的误差最小化的权重值,因此我们需要找到一种有效的数学方法来搜索这些最小化器:产生最小误差的特殊权重值。梯度下降法在这里起着关键作用。这种强大而简单的方法涉及计算误差函数的一个导数。这是我们花了一半的微积分课程来计算导数(以及梯度:这是更高维度的导数)的原因之一。还有其他方法需要计算二阶导数。我们将遇到它们并评论使用高阶方法的优点和缺点。

Since our goal is to find the weight values that minimize the error between our predictions and ground truths, we need to find an efficient mathematical way to search for these minimizers: those special weight values that produce the least error. The gradient descent method plays a key role here. This powerful yet simple method involves calculating one derivative of our error function. This is one reason we spent half of our calculus classes calculating derivatives (and the gradient: this is one derivative in higher dimensions). There are other methods that require computing two derivatives. We will encounter them and comment on the benefits and the downsides of using higher-order methods.

使用反向传播算法
Use the backpropagation algorithm

当数据集是巨大的,我们的模型恰好是一个分层神经网络,我们需要一种有效的方法来计算这个导数。反向传播算法此时介入。我们将在第 4 章中介绍梯度下降和反向传播。

When data sets are enormous and our model happens to be a layered neural network, we need an efficient way to calculate this one derivative. The backpropagation algorithm steps in at this point. We will walk through gradient descent and backpropagation in Chapter 4.

正则化函数
Regularize a function

如果我们的学习函数太适合给定的数据,那么它在新数据上的表现就不好。与数据拟合得太好的函数会拾取数据和信号中的噪声(例如,图 3-1左侧的函数)。我们不想听到噪音。这就是正则化发挥作用的地方。有多种数学方法可以对函数进行正则化,这意味着使其更加平滑,减少振荡和不稳定。一般来说,跟随数据中噪声的函数振荡太多。我们想要更多常规功能。我们将在第 4 章中讨论正则化技术。

If our learning function fits the given data too well, then it will not perform well on new data. A function with too good of a fit with the data picks up on the noise in the data as well as the signal (for example, the function on the left in Figure 3-1). We do not want to pick up on noise. This is where regularization helps. There are multiple mathematical ways to regularize a function, which means to make it smoother and less oscillatory and erratic. In general, a function that follows the noise in the data oscillates too much. We want more regular functions. We visit regularization techniques in Chapter 4.

埃麦0301
图 3-1。左:拟合函数完美拟合数据;然而,它不是一个好的预测函数,因为它适合数据中的噪声而不是主信号。右:拟合相同数据集的更常规函数。使用此函数将比左侧子图中的函数提供更好的预测,即使左侧子图中的函数与数据点匹配得更好。

在以下部分中,我们将探讨这一点具有真实但简单的数据集的人工智能问题的结构。我们将在后续章节中看到相同的概念如何推广到更多复杂的任务。

In the following sections, we explore this structure of an AI problem with real, but simple, data sets. We will see in subsequent chapters how the same concepts generalize to much more involved tasks.

传统且非常有用的机器学习模型

Traditional and Very Useful Machine Learning Models

一切本章中使用的数据被标记为真实数据,我们模型的目标是预测新的(未见过的)和未标记数据的标签。这就是监督学习。

All the data used in this chapter is labeled with ground truths, and the goal of our models is to predict the labels of new (unseen) and unlabeled data. This is supervised learning.

在接下来的几节中,我们使用以下流行的机器学习模型将训练函数拟合到标记数据中。虽然您可能听说了很多有关人工智能最新和最伟大发展的信息,但在典型的业务环境中,您可能最好从这些更传统的模型开始:

In the next few sections, we fit training functions into our labeled data using the following popular machine learning models. While you may hear so much about the latest and greatest developments in AI, in a typical business setting you are probably better off starting with these more traditional models:

线性回归
Linear regression

预测一个数值。

Predict a numerical value.

逻辑回归
Logistic regression

分为两类(二元分类)。

Classify into two classes (binary classification).

Softmax回归
Softmax regression

分为多个类。

Classify into multiple classes.

支持向量机
Support vector machines

分为两类,或回归(预测数值)。

Classify into two classes, or regression (predict a numerical value).

决策树
Decision trees

分类为任意数量的类别,或回归(预测数值)。

Classify into any number of classes, or regression (predict a numerical value).

随机森林
Random forests

分类为任意数量的类别,或回归(预测数值)。

Classify into any number of classes, or regression (predict a numerical value).

模型组合
Ensembles of models

通过平均预测值、投票选出最受欢迎的类别或其他一些捆绑机制,将许多模型的结果捆绑在一起。

Bundle up the results of many models by averaging the prediction values, voting for the most popular class, or some other bundling mechanism.

k-均值聚类
k-means clustering

分类为任意数量的类,或回归

Classify into any number of classes, or regression

我们在相同的数据集上尝试多个模型来比较性能。在现实世界中,很少有任何模型在没有与许多其他模型进行比较的情况下得到部署。这是计算量大的人工智能行业的本质,也是为什么我们需要并行计算,它使我们能够同时训练多个模型(除了构建和改进其他模型结果的模型,例如堆叠的情况);对于那些我们不能使用的并行计算)。

We try multiple models on the same data sets to compare performance. In the real world, it is rare that any model ever gets deployed without having been compared with many other models. This is the nature of the computation-heavy AI industry, and why we need parallel computing, which enables us to train multiple models at once (except for models that build and improve on the results of other models, like in the case of stacking; for those we cannot use parallel computing).

在我们深入研究任何机器学习模型之前,非常重要的是要注意,据一再报道,数据科学家和/或人工智能研究人员只有大约 5% 的时间花在训练机器学习模型上。在将数据输入机器学习模型之前,大部分时间消耗在获取数据、清理数据、组织数据、为数据创建适当的管道等方面。因此,机器学习只是生产过程中的一个步骤,一旦数据准备好训练模型,这一步就很容易了。我们将发现这些机器学习模型是如何工作的:我们需要的大部分数学知识都存在于这些模型中。人工智能研究人员一直在尝试增强机器学习模型并将其自动融入生产流程。因此,对我们来说,最终了解整个管道非常重要,从原始数据(包括其存储、硬件、查询协议等)到部署再到监控。学习机器学习只是一个更大、更有趣的故事的一部分。

Before we dive into any machine learning models, it is extremely important to note that it has been reported again and again that only about 5% of a data scientist’s time, and/or an AI researcher’s time, is spent on training machine learning models. The majority of the time is consumed by acquiring data, cleaning data, organizing data, creating appropriate pipelines for data, etc., before feeding the data into machine learning models. So machine learning is only one step in the production process, and it is an easy step once the data is ready to train the model. We will discover how these machine learning models work: most of the mathematics we need resides in these models. AI researchers are always trying to enhance machine learning models and automatically fit them into production pipelines. It is therefore important for us to eventually learn about the whole pipeline, from raw data (including its storage, hardware, query protocols, etc.) to deployment to monitoring. Learning machine learning is only one piece of a bigger and more interesting story.

我们必须开始因为回归的思想对于接下来的大多数人工智能模型和应用程序来说都是非常基础的。仅对于线性回归,我们才能使用分析方法找到最小化权重,直接根据训练数据集及其目标标签给出所需权重的明确公式。线性回归模型的简单性允许这种显式解析解。大多数其他模型没有这样的显式解,我们必须使用数值方法找到它们的极小值,其中梯度下降是非常流行的。

We must start with regression since the ideas of regression are so fundamental for most of the AI models and applications that will follow. Only for linear regression do we find our minimizing weights using an analytical method, giving an explicit formula for the desired weights directly in terms of the training data set and its target labels. The simplicity of the linear regression model allows for this explicit analytical solution. Most other models do not have such explicit solutions, and we have to find their minimizers using numerical methods, among which the gradient descent is extremely popular.

在回归和许多其他即将推出的模型中,包括接下来几章的神经网络,请注意建模过程中的以下进展:

In regression and many other upcoming models, including the neural networks of the next few chapters, watch for the following progression in the modeling process:

  1. 训练功能

  2. The training function

  3. 损失函数

  4. The loss function

  5. 优化

  6. Optimization

数值解与分析解

Numerical Solutions Versus Analytical Solutions

这是了解数学问题的数值解和解析解之间的差异非常重要。数学问题可以是任何问题,例如:

It is important to be aware of the differences between numerical solutions and analytical solutions of mathematical problems. A mathematical problem can be anything, such as:

  • 求某个函数的极小值。

  • Find the minimizer of some function.

  • 在预算有限的情况下找到从目的地A到目的地B的最佳路线。

  • Find the best way to go from destination A to destination B, with a constrained budget.

  • 找到设计和查询数据仓库的最佳方法。

  • Find the best way to design and query a data warehouse.

  • 找到数学方程的解(其中包含数学内容的左侧等于包含数学内容的右侧)。这些方程可以是代数方程、常微分方程、偏微分方程、积分微分方程、方程组或任何类型的数学方程。他们的解决方案可以是静态的,也可以随时间变化。他们可以对物理、生物、社会经济或自然世界中的任何事物进行建模。

  • Find the solution of a mathematical equation (where a lefthand side with math stuff equals a righthand side with math stuff). These equations could be algebraic equations, ordinary differential equations, partial differential equations, integro-differential equations, systems of equations, or any sort of mathematical equations. Their solutions could be static or evolving in time. They could model anything from the physical, biological, socioeconomic, or natural worlds.

这是词汇:

Here is the vocabulary:

数值
Numerical

和数字有关系

Has to do with numbers

分析型
Analytical

与分析有关

Has to do with analysis

根据经验,只要我们有足够的计算能力来模拟和计算这些解决方案,数值解决方案比解析解决方案更容易获得且更容易获得。我们通常需要做的就是离散一些连续空间和/或函数(将连续体变成一堆点),尽管有时以非常聪明的方式,并根据这些离散量评估函数。数值解的唯一问题是它们只是近似解。除非它们得到了对它们与真实解析解的距离有多远以及它们收敛到这些真实解的速度的估计的支持,而这反过来又需要数学背景和分析,否则数值解是不精确的。然而,它们确实提供了关于真正解决方案的非常有用的见解。在许多情况下,数值解是唯一可用的,如果不依赖于复杂问题的数值解,许多科学和工程领域根本不会取得进步。如果这些领域等待解析解和证明出现,或者换句话说,等待数学理论迎头赶上,那么它们的进展将会非常缓慢。

As a rule of thumb, numerical solutions are much easier to obtain and much more accessible than analytical solutions, provided that we have enough computational power to simulate and compute these solutions. All we usually need to do is discretize some continuous spaces and/or functions (change the continuum into a bunch of points), albeit sometimes in very clever ways, and evaluate functions on these discrete quantities. The only problem with numerical solutions is that they are only approximate solutions. Unless they are backed by estimates on how far off they are from the true analytical solutions and how fast they converge to these true solutions, which in turn requires mathematical background and analysis, numerical solutions are not exact. They do, however, provide incredibly useful insights about the true solutions. In many cases, numerical solutions are the only ones available, and many scientific and engineering fields would not have advanced at all had they not relied on numerical solutions of complex problems. If those fields waited for analytical solutions and proofs to happen, or, in other words, for mathematical theory to catch up, they would have had very slow progress.

另一方面,解析解是精确的、稳健的,并且有完整的数学理论支持。它们附有定理和证明。当分析解决方案可用时,它们的作用非常强大。然而,它们并不容易获得,有时甚至不可能获得,并且它们确实需要微积分、数学分析、代数、微分方程理论等领域的深厚知识和领域专业知识。然而,分析方法对于描述解的重要属性(即使没有显式解),指导数值技术,并提供基本事实来比较近似数值方法(幸运的是,当这些解析解可用时)。

Analytical solutions, on the other hand, are exact, robust, and have a whole mathematical theory backing them up. They come accompanied with theorems and proofs. When analytical solutions are available, they are very powerful. They are, however, not easily accessible, sometimes impossible to obtain, and they do require deep knowledge and domain expertise in fields such as calculus, mathematical analysis, algebra, theory of differential equations, etc. Analytical methods, however, are extremely valuable for describing important properties of solutions (even when explicit solutions are not available), guiding numerical techniques, and providing ground truths to compare approximate numerical methods against (in the lucky cases when these analytical solutions are available).

一些研究人员纯粹是分析和理论的,其他研究人员是纯粹数值和计算的,最好的存在位置是在交叉点附近的某个地方,在那里我们对数学的分析和数值方面有很好的理解问题。

Some researchers are purely analytical and theoretical, others are purely numerical and computational, and the best place to exist is somewhere near the intersection, where we have a decent understanding of the analytical and the numerical aspects of our mathematical problems.

回归:预测数值

Regression: Predict a Numerical Value

一个快速Kaggle 网站上搜索回归数据集会返回许多优秀的数据集和相关笔记本。我随机选择了一个简单的鱼市数据集,我们将用它来解释即将到来的数学。我们的目标是建立一个模型,根据鱼的五种不同的长度测量值或特征(在数据集中标记为长度 1、长度 2、长度 3、高度和宽度)来预测鱼的重量(见图 3-2 。为了简单起见,我们选择不将分类特征“物种”合并到该模型中,尽管我们可以这样做(这会给我们带来更好的预测,因为鱼的类型可以很好地预测其重量)。如果我们选择包含物种特征,那么我们必须将其值转换为数值,使用one-hot 编码,顾名思义:根据鱼的类别(类型)为每条鱼分配一个由 1 和 0 组成的代码。我们的物种功能有七个类别:鲈鱼、鳊鱼、蟑螂、梭子鱼、胡瓜鱼、白鱼和白鱼。因此,如果我们的鱼是梭子鱼,我们会将其物种编码为 (0,0,0,1,0,0,0),如果它是鳊鱼,我们会将其物种编码为 (0,1,0,0 ,0,0,0)。当然,这为我们的特征空间增加了七个维度,并增加了七个需要训练的权重。

A quick search on the Kaggle website for data sets for regression returns many excellent data sets and related notebooks. I randomly chose a simple Fish Market data set that we will use to explain our upcoming mathematics. Our goal is to build a model that predicts the weight of a fish given its five different length measurements, or features, labeled in the data set as Length1, Length2, Length3, Height, and Width (see Figure 3-2). For the sake of simplicity, we choose not to incorporate the categorical feature, Species, into this model, even though we could (and that would give us better predictions, since a fish’s type is a good predictor of its weight). If we choose to include the Species feature, then we would have to convert its values into numerical values, using one-hot encoding, which means exactly what it sounds like: assign a code for each fish made up of ones and zeros based on its category (type). Our Species feature has seven categories: Perch, Bream, Roach, Pike, Smelt, Parkki, and Whitefish. So if our fish is a Pike, we would code its species as (0,0,0,1,0,0,0), and if it is a Bream we would code its species as (0,1,0,0,0,0,0). Of course this adds seven more dimensions to our feature space and seven more weights to train.

埃麦0302
图 3-2。从Kaggle下载的 Fish Market 数据集的前五行。重量列是目标特征,我们的目标是建立一个模型,根据长度测量值预测新鱼的重量。

让我们节省墨水空间并将五个功能重新标记为 X 1 , X 2 , X 3 , X 4 , 和 X 5 ,然后将鱼的重量写为这五个特征的函数, y = F X 1 , X 2 , X 3 , X 4 , X 5 。这样,一旦我们为该函数确定了可接受的公式,我们所要做的就是输入某条鱼的特征值,我们的函数将输出该鱼的预测重量。

Let’s save ink space and relabel our five features as x 1 , x 2 , x 3 , x 4 , and x 5 , then write the fish weight as a function of these five features, y = f ( x 1 , x 2 , x 3 , x 4 , x 5 ) . This way, once we settle on an acceptable formula for this function, all we have to do is input the feature values for a certain fish and our function will output the predicted weight of that fish.

本节为接下来的一切奠定了基础,因此首先了解它的组织方式很重要:

This section builds a foundation for everything to come, so it is important to first see how it is organized:

训练功能
Training function
  • 参数模型与非参数模型。

  • Parametric models versus nonparametric models.

损失函数
Loss function
  • 预测值与真实值。

  • 绝对值距离与平方距离。

  • 具有奇点(尖点)的函数。

  • 对于线性回归,损失函数是均方误差。

  • 本书中的向量始终是列向量。

  • 训练、验证和测试子集。

  • 当训练数据具有高度相关的特征时。

  • The predicted value versus the true value.

  • The absolute value distance versus the squared distance.

  • Functions with singularities (pointy points).

  • For linear regression, the loss function is the mean squared error.

  • Vectors in this book are always column vectors.

  • The training, validation, and test subsets.

  • When the training data has highly correlated features.

优化
Optimization
  • 凸面景观与非凸面景观。

  • 我们如何找到函数的最小化点?

  • 简而言之,微积分。

  • 一维优化示例。

  • 我们一直使用的线性代数表达式的导数。

  • 最小化均方误差损失函数。

  • 注意:将大矩阵相互相乘的开销非常大,而是将矩阵与向量相乘。

  • 注意:我们永远不想拟合训练数据出色地。

  • Convex landscapes versus nonconvex landscapes.

  • How do we locate minimizers of functions?

  • Calculus in a nutshell.

  • A one-dimensional optimization example.

  • Derivatives of linear algebra expressions that we use all the time.

  • Minimizing the mean squared error loss function.

  • Caution: multiplying large matrices by each other is very expensive—multiply matrices by vectors instead.

  • Caution: we never want to fit the training data too well.

培训功能

Training Function

一个快速对数据的探索,如根据各种长度特征绘制权重,使我们能够假设一个线性模型(尽管在这种情况下非线性模型可能更好)。也就是说,我们假设权重与长度特征线性相关(见图3-3)。

A quick exploration of the data, as in plotting the weight against the various length features, allows us to assume a linear model (even though a nonlinear one could be better in this case). That is, we assume that the weight depends linearly on the length features (see Figure 3-3).

这意味着鱼的重量y可以使用五个不同长度测量值的线性组合以及偏差项来计算 ω 0 ,给出以下训练函数

This means that the weight of a fish, y, can be computed using a linear combination of its five different length measurements, plus a bias term ω 0 , giving the following training function:

y = ω 0 + ω 1 X 1 + ω 2 X 2 + ω 3 X 3 + ω 4 X 4 + ω 5 X 5

在我们在建模过程中做出使用线性训练函数的重大决定之后 F X 1 , X 2 , X 3 , X 4 , X 5 ,我们要做的就是找到合适的参数值 ω 0 , ω 1 , ω 2 , ω 3 , ω 4 , 和 ω 5 。我们将学习最适合我们的价值观 ω 来自数据。使用数据寻找合适的过程 ω 称为训练模型。经过训练的模型是这样的模型: ω 值已经决定了。

After our major decision in the modeling process to use a linear training function f ( x 1 , x 2 , x 3 , x 4 , x 5 ) , all we have to do is to find the appropriate values of the parameters ω 0 , ω 1 , ω 2 , ω 3 , ω 4 , and ω 5 . We will learn the best values for our ω ’s from the data. The process of using the data to find the appropriate ω ’s is called training the model. A trained model is then a model where the ω values have been decided on.

一般来说,训练函数,无论是线性的还是非线性的,包括那些代表神经网络的函数,都有未知的参数, ω 的,我们需要从给定的数据中学习。对于线性模型,每个参数在预测过程中赋予每个特征一定的权重。所以如果值 ω 2 大于 的值 ω 5 ,那么假设第二个和第五个特征具有可比的尺度,那么第二个特征在我们的预测中比第五个特征发挥更重要的作用。这是在训练模型之前对数据进行缩放或标准化的好处之一。另一方面,如果该值 ω 3 与第三个特征相关的死亡,意义变为零或可以忽略不计,那么第三个特征可以从数据集中省略,因为它在我们的预测中不起任何作用。因此,学习我们的 ω 数据中的 允许我们以数学方式计算每个特征对我们的预测的贡献(或者如果在训练之前的数据准备阶段组合了某些特征,则特征组合的重要性)。换句话说,模型学习数据特征如何交互以及这些交互的强度。其寓意是,通过训练有素的学习功能,我们可以量化特征如何组合在一起以产生观察到的和尚未观察到的结果。

In general, training functions, whether linear or nonlinear, including those representing neural networks, have unknown parameters, ω ’s, that we need to learn from the given data. For linear models, each parameter gives each feature a certain weight in the prediction process. So if the value of ω 2 is larger than the value of ω 5 , then the second feature plays a more important role than the fifth feature in our prediction, assuming that the second and fifth features have comparable scales. This is one of the reasons it is good to scale or normalize the data before training the model. If, on the other hand, the value ω 3 associated with the third feature dies, meaning becomes zero or negligible, then the third feature can be omitted from the data set, as it plays no role in our predictions. Therefore, learning our ω ’s from the data allows us to mathematically compute the contribution of each feature to our predictions (or the importance of feature combinations if some features were combined during the data preparation stage, before training). In other words, the models learn how the data features interact and how strong these interactions are. The moral is that through a trained learning function, we can quantify how features come together to produce both observed and yet-to-be observed results.

埃麦0303
图 3-3。鱼市场数值特征的散点图。有关更多详细信息,请参阅本书的 GitHub 页面,或Kaggle 上与此数据集相关的一些公共笔记本。

损失函数

Loss Function

我们有说服自己,下一个合乎逻辑的步骤是为 ω 使用我们拥有的数据出现在(我们的线性参数模型)训练函数中。为此,我们需要优化适当的损失函数

We have convinced ourselves that the next logical step is finding suitable values for the ω ’s that appear in the training function (of our linear parametric model), using the data that we have. To do that, we need to optimize an appropriate loss function.

预测值与真实值

The predicted value versus the true value

认为我们为每个未知数分配一些随机数值 ω 0 , ω 1 , ω 2 , ω 3 , ω 4 , 和 ω 5 ——比如说, ω 0 = - 3 , ω 1 = 4 , ω 2 = 0 2 , ω 3 = 0 03 , ω 4 = 0 4 , 和 ω 5 = 0 5 。那么线性训练函数的公式 y = ω 0 + ω 1 X 1 + ω 2 X 2 + ω 3 X 3 + ω 4 X 4 + ω 5 X 5 变成:

Suppose we assign some random numerical values for each of our unknown ω 0 , ω 1 , ω 2 , ω 3 , ω 4 , and ω 5 —say, for example, ω 0 = - 3 , ω 1 = 4 , ω 2 = 0 . 2 , ω 3 = 0 . 03 , ω 4 = 0 . 4 , and ω 5 = 0 . 5 . Then the formula for the linear training function y = ω 0 + ω 1 x 1 + ω 2 x 2 + ω 3 x 3 + ω 4 x 4 + ω 5 x 5 becomes:

y = - 3 + 4 X 1 + 0 2 X 2 + 0 03 X 3 + 0 4 X 4 + 0 5 X 5

并准备做出预测。代入长度特征的数值 tH 鱼,然后获得该鱼的重量的预测值。例如,我们数据集中的第一条鱼是鳊鱼,并且具有长度测量值 X 1 1 = 23 2 , X 2 1 = 25 4 , X 3 1 = 30 , X 4 1 = 11 52 , 和 X 5 1 = 4 02 。将它们插入训练函数中,我们就可以预测这条鱼的重量:

and is ready to make predictions. Plug in numerical values for the length features of the i th fish, then obtain a predicted value for the weight of this fish. For example, the first fish in our data set is a bream and has length measurements x 1 1 = 23 . 2 , x 2 1 = 25 . 4 , x 3 1 = 30 , x 4 1 = 11 . 52 , and x 5 1 = 4 . 02 . Plugging these into the training function, we get the prediction for the weight of this fish:

y predCt 1 = ω 0 + ω 1 X 1 1 + ω 2 X 2 1 + ω 3 X 3 1 + ω 4 X 4 1 + ω 5 X 5 1 = - 3 + 4 23 2 + 0 2 25 4 + 0 03 30 + 0 4 11 52 + 0 5 4 02 = 102 第398章

一般来说,对于 tH 鱼,我们有:

In general, for the i th fish, we have:

y predCt = ω 0 + ω 1 X 1 + ω 2 X 2 + ω 3 X 3 + ω 4 X 4 + ω 5 X 5

然而,所考虑的鱼具有一定的真实重量, y tre ,如果它属于标记数据集,则它是它的标签。对于我们数据集中的第一条鱼,真实重量是 y tre 1 = 第242章 克。我们随机选择的线性模型 ω 预计重量为 102.398 克。这当然是相当遥远的,因为我们没有校准 ω 根本没有价值观。无论如何,我们可以测量模型预测的权重与真实权重之间的误差,然后找到在我们的选择方面做得更好的方法 ω 的。

The fish under consideration, however, has a certain true weight, y true i , which is its label if it belongs in the labeled data set. For the first fish in our data set, the true weight is y true 1 = 242 grams. Our linear model with randomly chosen ω values predicted 102.398 grams. This is of course pretty far off, since we did not calibrate the ω values at all. In any case, we can measure the error between the weight predicted by our model and the true weight, then find ways to do better in terms of our choices for the ω ’s.

绝对值距离与平方距离

The absolute value distance versus the squared distance

之一数学的好处在于它有多种方法可以使用不同的距离度量来测量事物之间的距离。例如,我们可以天真地测量两个量之间的距离,如果它们不同则为 1,如果相同则为 0,编码为:不同 - 1,相似 - 0。当然,使用这样一个朴素的度量,我们会丢失大量信息,因为 2 和 10 等数量之间的距离将等于 2 和 100 万之间的距离,即 1。

One of the nice things about mathematics is that it has multiple ways to measure how far things are from each other, using different distance metrics. For example, we can naively measure the distance between two quantities as being 1 if they are different, and 0 if they are the same, encoding the words: different-1, similar-0. Of course, using such a naive metric, we lose a ton of information, since the distance between quantities such as 2 and 10 will be equal to the distance between 2 and 1 million, namely 1.

有一些在机器学习中很流行的距离度量。我们先介绍一下最常用的两个:

There are some distance metrics that are popular in machine learning. We first introduce the two most commonly used:

  • 绝对值距离: | y predCt - y tre | ,源于微积分函数 | X |

  • The absolute value distance: | y predict - y true | , stemming from the calculus function | x | .

  • 距离平方: | y predCt - y tre | 2 ,源于微积分函数 |X| 2 (这与 X 2 对于标量)。当然,这也会对单位进行平方。

  • The squared distance: | y predict - y true | 2 , stemming from the calculus function |x| 2 (which is the same as x 2 for scalar quantities). Of course, this will square the units as well.

检查函数的图形 | X | X 2 图 3-4中,我们注意到 (0,0) 点函数平滑度存在很大差异。功能 | X | 在该点有一个角,使其在 x = 0 处不可微 | X | x = 0 时,许多实践者(和数学家!)不再将这个函数或具有类似奇点的函数合并到他们的模型中。但是,让我们将以下内容铭刻在我们的大脑中。

Inspecting the graphs of the functions | x | and x 2 in Figure 3-4, we notice a great difference in function smoothness at the point (0,0). The function | x | has a corner at that point, rendering it undifferentiable at x = 0. This singularity of | x | at x = 0 turns many practitioners (and mathematicians!) away from incorporating this function, or functions with similar singularities, into their models. However, let’s engrave the following into our brains.

数学模型是灵活的。当我们遇到障碍时,我们会更深入地挖掘,了解发生了什么,然后克服障碍。

Mathematical models are flexible. When we encounter a hurdle we dig deeper, understand what’s going on, then we work around the hurdle.

埃麦0304
图 3-4。左:图表 | X | 有一个角在 X = 0 ,此时其导数未定义。右:图表 |X| 2 平滑于 X = 0 ,所以它的导数在那里没有问题。

除了函数的规律性不同之外 | X | |X| 2 (意味着它们是否在所有点上都有导数),在决定将任一函数纳入我们的误差公式之前,我们还需要注意一点:如果一个数字很大,那么它的平方更大。这个简单的观察意味着,如果我们决定使用真实值和预测值之间的平方距离来测量误差,那么我们的方法将对数据中的异常值更加敏感。一个混乱的异常值可能会使我们的整个预测函数偏向它,从而偏离数据中更普遍的模式。理想情况下,在将数据输入任何机器学习模型之前,我们会在数据准备步骤中处理异常值并决定是否应该保留它们。

Other than the difference in the regularity of the functions | x | and |x| 2 (meaning whether they have derivatives at all points or not), there is one more point that we need to pay attention to before deciding to incorporate either function into our error formula: if a number is large, then its square is even larger. This simple observation means that if we decide to measure the error using squared distances between true values and predicted values, then our method will be more sensitive to the outliers in the data. One messed-up outlier might skew our whole prediction function toward it, and hence away from the more prevalent patterns in the data. Ideally, we would have taken care of outliers and decided whether we should keep them or not during the data preparation step, before feeding the data into any machine learning model.

最后一个区别 | X | (以及类似的分段线性函数)和 X 2 (以及类似的非线性但可微函数)是 | X | 很简单:

One last difference between | x | (and similar piecewise linear functions) and x 2 (and similar nonlinear but differentiable functions) is that the derivative of | x | is very easy:

  • 如果x > 0,则为 1;如果x < 0 ,则为 –1 (如果x = 0,则未定义)
  • 1 if x > 0, –1 if x < 0 (and undefined if x = 0)

在涉及数十亿计算步骤的模型中,在使用导数时无需评估任何内容的属性 | X | 事实证明是非常有价值的。既不是线性也不是分段线性的函数的导数通常涉及求值(因为它们的公式中也有x,而不仅仅是常数,就像分段线性情况一样),这在以下方面可能会很昂贵:大数据设置。

In a model that involves billions of computational steps, this property where there is no need to evaluate anything when using the derivative of | x | proves to be extremely valuable. Derivatives of functions that are neither linear nor piecewise linear usually involve evaluations (because they also have x’s in their formulas and not only constants, like in the piecewise linear case), which can be expensive in big data settings.

具有奇点的函数

Functions with singularities

一般来说,可微函数图没有尖点、扭结、拐角或任何尖锐的东西。如果它们确实具有这样的奇点,那么函数在这些点上没有导数。原因是,在某个点处,您可以在函数图像上绘制两条不同的切线,具体取决于您决定将切线绘制到该点的左侧还是右侧(见图3-5) 。回想一下,函数在某一点的导数是函数图形在该点的切线的斜率。如果有两个不同的斜率,那么我们无法定义该点的导数。

In general, graphs of differentiable functions do not have cusps, kinks, corners, or anything pointy. If they do have such singularities, then the function has no derivative at these points. The reason is that at a pointy point, you can draw two different tangent lines to the graph of the function, depending on whether you decide to draw the tangent line to the left or to the right of the point (see Figure 3-5). Recall that the derivative of a function at a point is the slope of the tangent line to the graph of the function at that point. If there are two different slopes, then we cannot define the derivative at the point.

埃麦0305
图 3-5。在奇点处,导数不存在。在这些点处存在不止一种可能的切线斜率。

切线斜率的这种不连续性给依赖于计算函数导数的方法(例如梯度下降法)带来了问题。这里的问题是双重的:

This discontinuity in the slope of the tangent creates a problem for methods that rely on evaluating the derivative of the function, such as the gradient descent method. The problem here is twofold:

未定义导数
Undefined derivative

您应该使用什么衍生价值?如果您碰巧落在一个奇怪的尖点上,那么该方法不知道该怎么做,因为那里没有定义的导数。有些人为该点的导数指定一个值(称为次梯度次微分)并继续。事实上,我们不幸恰好落在那个可怕的时刻的可能性有多大?除非函数的地形看起来像阿尔卑斯山的崎岖地形(实际上,很多都是如此),否则数值方法可能会设法避免它们。

What derivative value should you use? If you happen to land at a quirky pointy point, then the method doesn’t know what to do, since there is no defined derivative there. Some people assign a value for the derivative at that point (called the subgradient or the subdifferential) and move on. In reality, what are the odds that we will be unlucky enough to land exactly at that one horrible point? Unless the landscape of the function looks like the rough terrain of the Alps (actually, many do), the numerical method might manage to avoid them.

不稳定
Instability

另一个问题是不稳定。由于当您穿过该点的函数范围时,导数的值会突然跳跃,因此使用该导数的方法也会突然改变值,如果您试图在某个地方收敛,则会产生不稳定。想象一下,您正在图 3-6中的瑞士阿尔卑斯山(损失函数的景观)中徒步旅行,您的目的地是您可以在山谷中看到的那个漂亮的小镇(误差值最低的地方)。然后突然你被某个外星人(外星人就是依靠这种突然变化的导数的数学搜索方法)带到了山的另一边,在那里你再也看不到你的目的地了。事实上,现在你在山谷中只能看到一些丑陋的灌木丛和一个极其狭窄的峡谷,如果你的方法将你带到那里,它可能会困住你。你与最初目的地的融合现在即使没有完全迷失,也不稳定。

埃麦0306
图 3-6。瑞士阿尔卑斯山:优化类似于徒步旅行功能景观

The other problem is instability. Since the value of the derivative jumps so abruptly as you traverse the landscape of the function across this point, a method using this derivative will abruptly change value as well, creating instabilities if you are trying to converge somewhere. Imagine you are hiking down the Swiss Alps in Figure 3-6 (the landscape of the loss function) and your destination is that pretty little town that you can see down in the valley (the place with the lowest error value). Then suddenly you get carried away by some alien (the alien is the mathematical search method relying on this abruptly changing derivative) to the other side of the mountain, where you cannot see your destination anymore. In fact, now all you can see in the valley are some ugly shrubs and an extremely narrow canyon that can trap you if your method carries you there. Your convergence to your original destination is now unstable, if not totally lost.

Figure 3-6. Swiss Alps: optimization is similar to hiking the landscape of a function

然而,具有这种奇点的函数在机器学习中一直被使用。我们会在一些神经网络训练函数的公式(修正线性单位函数——谁命名这些函数?)、一些损失函数(绝对值距离)和一些正则化术语中遇到它们(套索回归——谁也给这些命名了?)。

Nevertheless, functions with such singularities are used all the time in machine learning. We will encounter them in the formulas of some neural network training functions (rectified linear unit function—who names these?), in some loss functions (absolute value distance), and in some regularizing terms (lasso regression—who names these too?).

对于线性回归,损失函数是均方误差

For linear regression, the loss function is the mean squared error

回到本节的主要目标:构建一个误差函数,也称为损失函数,它对我们的模型在进行预测时犯下的误差进行编码,并且必须使其很小。

Back to the main goal for this section: constructing an error function, also called the loss function, which encodes how much error our model commits when making its predictions, and must be made small.

对于线性回归,我们使用均方误差函数该函数对m 个数据点的预测值与真实值之间的平方距离误差进行平均(我们将很快提到此处要包括哪些数据点):

For linear regression, we use the mean squared error function. This function averages over the squared distance errors between the prediction and the true value for m data points (we will mention which data points to include here shortly):

意思是 平方 错误 = 1 | y predCt 1 - y tre 1 | 2 + | y predCt 2 - y tre 2 | 2 + + |y predCt -y tre | 2

让我们使用求和符号将上面的表达式写得更紧凑:

Let’s write the above expression more compactly using the sum notation:

意思是 平方 错误 = 1 Σ =1 |y predCt -y tre | 2

现在我们养成了使用向量和矩阵的更紧凑的线性代数表示法的好习惯。事实证明,这个习惯在现场非常方便,因为我们不想在尝试跟踪索引时被淹没。指数可能会潜入我们了解一切的美好梦想,并迅速将其转变为非常可怕的噩梦。使用紧凑线性代数表示法的另一个重要原因是,为机器学习模型构建的软件和硬件都针对矩阵和张进行了优化(想象一下由分层矩阵组成的对象,例如三维盒子而不是平面正方形) ) 计算。此外,数值线性代数这个美丽的领域已经解决了许多潜在的问题,并为我们享受执行各种矩阵计算的快速方法铺平了道路。

Now we get into the great habit of using the even more compact linear algebra notation of vectors and matrices. This habit proves extremely handy in the field, as we don’t want to drown while trying to keep track of indices. Indices can sneak up into our rosy dreams of understanding everything and quickly transform them into very scary nightmares. Another important reason to use the compact linear algebra notation is that both the software and the hardware built for machine learning models are optimized for matrix and tensor (think of an object made of layered matrices, like a three-dimensional box instead of a flat square) computations. Moreover, the beautiful field of numerical linear algebra has worked through many potential problems and paved the way for us to enjoy the fast methods to perform all kinds of matrix computations.

使用线性代数符号,我们可以将均方误差写为:

Using linear algebra notation, we can write the mean squared error as:

意思是 平方 错误 = 1 y predCt -y tre t y predCt - y tre = 1 y predCt -y tre 2 2

最后一个等式引入了 2 向量的范数,根据定义,向量的范数就是 正方形 它是 成分

The last equality introduces the l 2 norm of a vector, which by definition is just the sum of squares of its components .

主要想法:我们构建的损失函数对涉及训练过程的数据点的预测与真实情况之间的差异进行编码,以某种规范(充当距离的数学实体)进行测量。我们还可以使用许多其他规范,但是 2 标准很漂亮受欢迎的。

Take-home idea: the loss function that we constructed encodes the difference between the predictions and the ground truths for the data points involved the training process, measured in some norm—a mathematical entity that acts as a distance. There are many other norms that we could have used, but the l 2 norm is pretty popular.

符号:本书中的向量始终是列向量

Notation: Vectors in this book are always column vectors

成为整本书中的符号一致,所有向量都是列向量。所以如果一个向量 v 有四个组成部分,符号 v 代表 v 1 v 2 v 3 v 4

To be consistent in notation throughout the book, all vectors are column vectors. So if a vector v has four components, the symbol v stands for v 1 v 2 v 3 v 4 .

向量的转置 v 那么,总是一个行向量。上述具有四个分量的向量的转置为 v t = v 1 v 2 v 3 v 4

The transpose of a vector v is, then, always a row vector. The transpose of the above vector with four components is v t = v 1 v 2 v 3 v 4 .

我们还可以使用点积表示法(也称为标量积,因为我们将两个向量相乘,但我们的答案是标量数)。两个向量的点积 A 是一样的事情 A t 。在本质上, A t 将列向量视为形状矩阵,向量长度乘以 1,并将其转置视为形状矩阵,向量长度乘以 1

We might also use the dot product notation (also called the scalar product because we multiply two vectors but our answer is a scalar number). The dot product of two vectors a . b is the same thing as a t b . In essence, a t b thinks of a column vector as a matrix of shape, length of the vector by 1, and its transpose as a matrix of shape, 1 by length of the vector.

现在假设 A 有四个组成部分,那么:

Suppose now that a and b have four components, then:

A t = A 1 A 2 A 3 A 4 1 2 3 4 = A 1 1 + A 2 2 + A 3 3 + A 4 4 = Σ =1 4 A

而且,

Moreover,

A 2 2 = A t A = A 1 2 + A 2 2 + A 3 2 + A 4 2

相似地,

Similarly,

2 2 = t = 1 2 + 2 2 + 3 2 + 4 2

这样,我们自始至终都使用矩阵表示法,并且只在字母上方放置一个箭头来指示我们正在处理列向量。

This way, we use matrix notation throughout, and we only put an arrow above a letter to indicate that we are dealing with a column vector.

训练、验证和测试子集

The training, validation, and test subsets

哪个我们的损失函数中包含数据点吗?我们是否包含整个数据集、一小批数据,甚至只包含一个点?我们是否测量训练子集验证子集测试子集中的数据点的均方误差?这些子集到底是什么?

Which data points do we include in our loss function? Do we include the whole data set, some small batches of it, or even only one point? Are we measuring this mean squared error for the data points in the training subset, the validation subset, or the test subset? And what are these subsets anyway?

在实践中,我们将数据集分为三个子集:

In practice, we split a data set into three subsets:

训练子集
Training subset

这是我们用来拟合训练函数的数据子集。这意味着该子集中的数据点将被纳入我们的损失函数中(通过将其特征值和标签插入到损失函数中) y predCt y tre 损失函数)。

This is the subset of the data that we use to fit our training function. This means that the data points in this subset are the ones that get incorporated into our loss function (by plugging their feature values and label into the y predict and the y true of the loss function).

验证子集
Validation subset

该子集中的数据点有多种使用方式:

  • 常见的描述是我们使用这个子集来调整机器学习模型的超参数。这超参数是机器学习模型中不属于机器学习模型的任何参数。 ω 我们正在尝试解决的训练函数的问题。在机器学习中,有很多这样的东西,它们的值会影响模型的结果和性能。超参数的例子包括(你不必知道它们是什么):梯度下降法中出现的学习率;支持向量机方法中决定边距宽度的超参数;原始数据分为训练、验证和测试子集的百分比;进行随机批量梯度下降时的批量大小;权重衰减超参数,例如岭回归、套索回归和弹性网络回归中使用的超参数;动量方法附带的超参数,例如动量梯度下降和 ADAM(这些项具有加速方法收敛到最小值的项,并且这些项与需要在测试和部署之前调整的超参数相乘);号码优化过程中的历元(优化器已经看到的整个训练子集的传递次数);以及神经网络的架构(例如层数、每层的宽度等)。

  • 验证子集还可以帮助我们知道在过度拟合训练子集之前何时停止优化。

  • 它还可以作为测试集来比较不同机器学习模型在同一数据集上的性能,例如比较线性回归模型、随机森林和神经网络的性能。

The data points in this subset are used in multiple ways:

  • The common description is that we use this subset to tune the hyperparameters of the machine learning model. The hyperparameters are any parameters in the machine learning model that are not the ω ’s of the training function that we are trying to solve for. In machine learning, there are many of these, and their values affect the results and the performance of the model. Examples of hyperparameters include (you don’t have to know what these are yet): the learning rate that appears in the gradient descent method; the hyperparameter that determines the width of the margin in support vector machine methods; the percentage of original data split into training, validation, and test subsets; the batch size when doing randomized batch gradient descent; weight decay hyperparameters such as those used in ridge, lasso, and elastic net regression; hyperparameters that come with momentum methods such as gradient descent with momentum and ADAM (these have terms that accelerate the convergence of the method toward the minimum, and these terms are multiplied by hyperparameters that need to be tuned before testing and deployment); the number of epochs during the optimization process (the number of passes over the entire training subset that the optimizer has seen); and the architecture of a neural network (such as the number of layers, the width of each layer, etc.).

  • The validation subset also helps us know when to stop optimizing before overfitting our training subset.

  • It also serves as a test set to compare the performance of different machine learning models on the same data set, for example, comparing the performance of a linear regression model to a random forest to a neural network.

测试子集
Test subset

在决定使用的最佳模型(或平均或聚合多个模型的结果)并训练模型后,我们使用这个未受影响的数据子集作为模型部署到现实世界之前的最后阶段测试。由于该模型之前没有见过该子集中的任何数据点(这意味着它在优化过程中没有包含任何数据点),因此它可以被认为是最接近真实情况的模拟。这使我们能够在开始完全应用模型之前判断模型的性能新的现实世界数据。

After deciding on the best model to use (or averaging or aggregating the results of multiple models) and training the model, we use this untouched subset of the data as a last-stage test for our model before deployment into the real world. Since the model has not seen any of the data points in this subset before (which means it has not included any of them in the optimization process), it can be considered as the closest analog to a real-world situation. This allows us to judge the performance of our model before we start applying it to completely new real-world data.

回顾

Recap

在继续之前让我们回顾一下:

Let’s recap a little before moving forward:

  • 我们当前的机器学习模型称为线性回归

  • Our current machine learning model is called linear regression.

  • 我们的训练函数与以下公式呈线性关系:

    y = ω 0 + ω 1 X 1 + ω 2 X 2 + ω 3 X 3 + ω 4 X 4 + ω 5 X 5

    X 是特征,并且 ω 是未知的权重或参数。

  • Our training function is linear with the formula:

    y = ω 0 + ω 1 x 1 + ω 2 x 2 + ω 3 x 3 + ω 4 x 4 + ω 5 x 5

    The x ’s are the features, and the ω ’s are the unknown weights or parameters.

  • 如果我们将特定数据点(例如第十个数据点)的特征值代入训练函数的公式中,我们就会得到模型对该点的预测:

    y predCt 10 = ω 0 + ω 1 X 1 10 + ω 2 X 2 10 + ω 3 X 3 10 + ω 4 X 4 10 + ω 5 X 5 10

    上标10表示这些是对应于第十个数据点的值。

  • If we plug in the feature values of a particular data point–for example, the tenth data point–into the formula of the training function, we get our model’s prediction for this point:

    y predict 10 = ω 0 + ω 1 x 1 10 + ω 2 x 2 10 + ω 3 x 3 10 + ω 4 x 4 10 + ω 5 x 5 10

    The superscript 10 indicates that these are values corresponding to the tenth data point.

  • 我们的损失函数是均方误差函数,其公式为:

    意思是 平方 错误 = 1 y predCt -y tre t y predCt - y tre = 1 y predCt -y tre 2 2
  • Our loss function is the mean squared error function with the formula:

    Mean Squared Error = 1 m (y predict -y true ) t ( y predict - y true ) = 1 m y predict -y true l 2 2
  • 我们想要找到的值 ω 是最小化这个损失函数。因此下一步必须解决最小化(优化)问题。

  • We want to find the values of the ω ’s that minimize this loss function. So the next step must be solving a minimization (optimization) problem.

为了使我们的优化过程变得更加容易,我们将再次使用线性代数(向量和矩阵)的方便表示法。这使我们能够将数据的整个训练子集作为矩阵包含在损失函数的公式中,并立即对训练子集进行计算,而不是分别对每个数据点进行计算。这个小小的符号操作使我们避免了许多错误、痛苦和繁琐的计算,其中许多组件很难在非常大的数据集上跟踪。

To make our optimization life much easier, we will once again employ the convenient notation of linear algebra (vectors and matrices). This allows us to include the entire training subset of the data as a matrix in the formula of the loss function, and do our computations immediately on the training subset, as opposed to computing on each data point separately. This little notation maneuver saves us from a lot of mistakes, pain, and tedious calculations with many components that are difficult to keep track of on very large data sets.

首先,写出我们的模型对应于训练子集的每个数据点的预测:

First, write the prediction of our model corresponding to each data point of the training subset:

y predCt 1 = 1 ω 0 + ω 1 X 1 1 + ω 2 X 2 1 + ω 3 X 3 1 + ω 4 X 4 1 + ω 5 X 5 1 y predCt 2 = 1 ω 0 + ω 1 X 1 2 + ω 2 X 2 2 + ω 3 X 3 2 + ω 4 X 4 2 + ω 5 X 5 2 y predCt = 1 ω 0 + ω 1 X 1 + ω 2 X 2 + ω 3 X 3 + ω 4 X 4 + ω 5 X 5

我们可以轻松地将这个系统安排为:

We can easily arrange this system as:

y predCt 1 y predCt 2 y predCt = 1 1 1 ω 0 + X 1 1 X 1 2 X 1 ω 1 + X 1 1 X 2 2 X 2 ω 2 + X 3 1 X 3 2 X 3 ω 3 + X 4 1 X 4 2 X 4 ω 4 + X 5 1 X 5 2 X 5 ω 5

或者甚至更好:

Or even better:

y predCt 1 y predCt 2 y predCt = 1 X 1 1 X 2 1 X 3 1 X 4 1 X 5 1 1 X 1 2 X 2 2 X 3 2 X 4 2 X 5 2 1 X 1 X 2 X 3 X 4 X 5 ω 0 ω 1 ω 2 ω 3 ω 4 ω 5

该方程左侧的向量是 y predCt ,右侧的矩阵是用 1 向量增广的训练子集X,右侧的最后一个向量将所有未知权重整齐地装入其中。将此向量称为 ω , 然后写 y predCt 就训练子集而言紧凑并且 ω 作为:

The vector on the lefthand side of that equation is y predict , the matrix on the righthand side is the training subset X augmented with the vector of ones, and the last vector on the righthand side has all the unknown weights packed neatly into it. Call this vector ω , then write y predict compactly in terms of the training subset and ω as:

y predCt = X ω

现在是均方误差损失函数的公式,我们之前写成:

Now the formula of the mean squared error loss function, which we wrote before as:

意思是 平方 错误 = 1 y predCt -y tre t y predCt - y tre = 1 y predCt -y tre 2 2

变成:

becomes:

意思是 平方 错误 = 1 Xω -y tre t X ω - y tre = 1 Xω -y tre 2 2

我们现在准备好寻找 ω 最大限度地减少整洁的损失函数。为此,我们必须参观丰富而美丽的优化数学领域。

We are now ready to find the ω that minimizes the neatly written loss function. For that, we have to visit the rich and beautiful mathematical field of optimization.

当训练数据具有高度相关的特征时

When the training data has highly correlated features

检验训练矩阵(用 1 向量增强):

Inspecting the training matrix (augmented with the vector of ones):

X = 1 X 1 1 X 2 1 X 3 1 X 4 1 X 5 1 1 X 1 2 X 2 2 X 3 2 X 4 2 X 5 2 1 X 1 X 2 X 3 X 4 X 5

出现在向量中 y predCt = X ω ,均方误差损失函数的公式,以及后面确定未知数的公式 ω (也称为正规方程):

that appears in the vector y predict = X ω , the formula of the mean squared error loss function, and later the formula that determines the unknown ω (also called the normal equation):

ω = X t X -1 X t y tre

我们可以看到,如果数据的两个或多个特征( x列)高度相关,我们的模型可能会出现问题。这意味着特征之间存在很强的线性关系,因此可以使用其他特征的线性组合来确定(或接近确定)其中一个特征。因此,相应的特征列不是线性独立的(或接近非线性独立的)。对于矩阵来说,这是一个问题,因为它表明矩阵要么不能逆,要么是病态的。病态矩阵会在计算中产生很大的不稳定性,因为训练数据的微小变化(必须假设)会导致模型参数发生很大的变化,从而导致其预测不可靠。

we can see how our model might have a problem if two or more features (x columns) of the data are highly correlated. This means that there is a strong linear relationship between the features, so one of these features can be determined (or nearly determined) using a linear combination of the others. Thus, the corresponding feature columns are not linearly independent (or close to not being linearly independent). For matrices, this is a problem, since it indicates that the matrix either cannot be inverted or is ill conditioned. Ill-conditioned matrices produce large instabilities in computations, since slight variations in the training data (which must be assumed) produce large variations in the model parameters and hence render its predictions unreliable.

我们希望在计算中得到良条件矩阵,因此我们必须消除不良条件的来源。当我们具有高度相关的特征时,一种可能的途径是在我们的模型中仅包含其中之一,因为其他特征不会添加太多信息。另一种解决方案是应用降维技术,例如主成分分析,我们将在第 11 章中遇到。Fish Market 数据集具有高度相关的特征,随附的 Jupyter Notebook 解决了这些问题。

We desire well-conditioned matrices in our computations, so we must get rid of the sources of ill conditioning. When we have highly correlated features, one possible avenue is to include only one of them in our model, as the others do not add much information. Another solution is to apply dimension reduction techniques such as principal component analysis, which we will encounter in Chapter 11. The Fish Market data set has highly correlated features, and the accompanying Jupyter Notebook addresses those.

也就是说,值得注意的是,一些机器学习模型,例如决策树和随机森林(即将讨论)不受相关特征的影响,而其他模型,例如当前的线性回归模型以及下一个逻辑回归和支持向量机模型,受到它们的负面影响。对于神经网络模型来说,尽管它们可以在训练过程中学习数据特征中涉及的相关性,但如果提前处理好这些冗余,除了节省计算量之外,它们的性能也会更好成本和时间。

That said, it is important to note that some machine learning models, such as decision trees and random forests (discussed soon) are not affected by correlated featured, while others, such as the current linear regression model, and the next logistic regression and support vector machine models, are negatively affected by them. As for neural network models, even though they can learn the correlations involved in the data features during training, they perform better when these redundancies are taken care of ahead of time, in addition to saving computation cost and time.

优化

Optimization

优化意味着找到最优、最好、最大、最小或极端的解决方案

Optimization means finding the optimal, best, maximal, minimal, or extreme solution.

我们写了一个线性训练函数:

We wrote a linear training function:

y = ω 0 + ω 1 X 1 + ω 2 X 2 + ω 3 X 3 + ω 4 X 4 + ω 5 X 5

我们留下了六个参数的值 ω 0 , ω 1 , ω 2 , ω 3 , ω 4 , 和 ω 5 未知。目标是找到使我们的训练函数最适合训练数据子集的值,其中“最佳”一词使用损失函数进行量化。该函数提供了模型训练函数做出的预测与真实情况的差距的度量。我们希望这个损失函数很小,所以我们解决一个最小化问题。

and we left the values of its six parameters ω 0 , ω 1 , ω 2 , ω 3 , ω 4 , and ω 5 unknown. The goal is to find the values that make our training function best fit the training data subset, where the word best is quantified using the loss function. This function provides a measure of how far the prediction made by the model’s training function is from the ground truth. We want this loss function to be small, so we solve a minimization problem.

我们不会坐在那里尝试一切可能的方法 ω 值,直到我们找到损失最小的组合。即使我们这样做了,我们也不知道何时停止,因为我们不知道是否还有其他更好的值。我们必须先了解损失函数的情况并利用其数学特性。打个比方,蒙眼徒步下瑞士阿尔卑斯山与不蒙眼、没有详细地图徒步旅行(图3-7显示了瑞士阿尔卑斯山的崎岖地形)。我们不是蒙着眼睛在损失函数中寻找最小化器,而是进入了优化领域。优化是数学的一个美丽分支,它提供了各种方法来有效地搜索和定位函数的优化器及其相应的最优值。

We are not going to sit there and try out every possible ω value until we find the combination that gives the least loss. Even if we did, we wouldn’t know when to stop, since we wouldn’t know whether there are other better values. We must have prior knowledge about the landscape of the loss function and take advantage of its mathematical properties. The analogy is hiking down the Swiss Alps with a blindfold versus hiking with no blindfold and a detailed map (Figure 3-7 shows the rough terrain of the Swiss Alps). Instead of searching the landscape of the loss function for minimizers with a blindfold, we tap into the field of optimization. Optimization is a beautiful branch of mathematics that provides various methods to efficiently search for and locate optimizers of functions and their corresponding optimal values.

本章和接下来几章中的优化问题如下所示:

The optimization problem in this chapter and in the next few looks like:

分钟 ω 损失 功能
埃麦0307
图 3-7。瑞士阿尔卑斯山:优化类似于远足景观的功能。目的地是最低谷的底部(最小化)或最高峰的顶部(最大化)。我们需要两件事:最小化或最大化点的坐标,以及这些点处的景观高度。

对于当前的线性回归模型,这是:

For the current linear regression model, this is:

分钟 ω 1 Xω -y tre t X ω - y tre = 分钟 ω 1 Xω -y tre 2 2

当我们做数学时,我们永远不应该忘记我们所知道的是什么以及我们正在寻找什么。否则我们就会面临陷入循环逻辑的风险。在刚才提到的公式中,我们知道:

When we do math, we should never lose track of what it is that we know and what it is that we are looking for. Otherwise we would run the risk of getting trapped in a circular logic. In the formula just mentioned, we know:

m

训练子集中的实例数

The number of instances in the training subset

X
X

用 1 向量增强的训练子集

The training subset augmented with a vector of ones

y tre
y true

与训练子集对应的标签向量

The vector of labels corresponding to the training subset

我们正在寻找:

And we are looking for:

  • 最小化 ω

  • The minimizing ω

  • 最小化时损失函数的最小值 ω

  • The minimum value of the loss function at the minimizing ω

凸面景观与非凸面景观

Convex landscapes versus nonconvex landscapes

容易处理的函数和 最容易求解的方程是线性的。不幸的是,我们处理的大多数函数(和方程)都是非线性的。同时,这也不算太不幸,因为线性生活是平淡、无聊、平淡、平淡的。当我们手头的函数是完全非线性的时,我们有时会在我们关心的某些点附近将其线性化。这里的想法是,即使函数的整体情况可能是非线性的,我们也可以在我们关注的局部用线性函数来近似它。换句话说,在非常小的邻域中,非线性函数可能看起来和行为都是线性的,尽管所述邻域可能无限小。打个比方,想想地球从我们自己的位置来看(以及在计算距离等方面的表现)是如何平坦的,以及我们如何只能从高处看到它的非线性形状。当我们想要对一点附近的函数进行线性化时,我们通过该点附近的切空间来近似它(如果它是一个变量的函数,则这是它的切线,如果它是两个变量的函数,则这是切平面,如果它是两个变量的函数,则这是切平面)如果它是三个或更多变量的函数)。为此,我们需要计算函数相对于其所有变量的一个导数,因为这给出了近似平坦空间的斜率(测量倾斜度)。

The easiest functions to deal with and the easiest equations to solve are linear. Unfortunately, most of the functions (and equations) that we deal with are nonlinear. At the same time, this is not too unfortunate since linear life is flat, boring, unstimulating, and uneventful. When the function we have at hand is full-blown nonlinear, we sometimes linearize it near certain points that we care for. The idea here is that even though the full picture of the function might be nonlinear, we may be able to approximate it with a linear function in the locality that we focus on. In other words, in a very small neighborhood, the nonlinear function might look and behave linearly, albeit the said neighborhood might be infinitesimally small. For an analogy, think about how Earth looks (and behaves in terms of calculating distances, etc.) flatly from our own locality, and how we can only see its nonlinear shape from high up. When we want to linearize a function near a point, we approximate it by its tangent space near that point (this is its tangent line if it is a function of one variable, tangent plane if it is a function of two variables, and tangent hyperplane if it is a function of three or more variables). For this, we need to calculate one derivative of the function with respect to all of its variables, since this gives us the slope (which measures the inclination) of the approximating flat space.

不幸的是,在一个点附近进行线性化可能还不够,我们可能想在多个位置使用线性近似。值得庆幸的是,这是可行的,因为我们所要做的计算就是在几个点上评估一个导数。这将我们引向下一个最容易处理的函数(在线性函数之后):分段线性函数,它们是线性的,但仅在分段结构中,或者除了在孤立的点或位置之外是线性的。场线性规划处理此类函数,其中要优化的函数是线性的,并且发生优化的域的边界是分段线性的(它们是半空间的交集)。

The sad news is that linearizing near one point may not be enough, and we may want to use linear approximations at multiple locations. Thankfully, that is doable, since we all we have to do computationally is to evaluate one derivative at several points. This leads us to the next easiest functions to deal with (after linear functions): piecewise linear functions, which are linear but only in piecewise structures, or linear except at isolated points or locations. The field of linear programming deals with such functions, where the functions to optimize are linear, and the boundaries of the domains where the optimization happens are piecewise linear (they are intersections of half spaces).

当我们的目标是优化时,要处理的最佳函数要么是线性的(线性规划领域可以帮助我们),要么是凸函数(我们不用担心陷入局部最小值,并且我们有很好的不等式来帮助我们分析)。

When our goal is optimization, the best functions to deal with are either linear (where the field of linear programming helps us) or convex (where we do not worry about getting stuck at local minima, and where we have good inequalities that help us with the analysis).

机器学习中出现的一种需要记住的重要函数类型是两个或多个凸函数的最大值的函数。这些函数总是凸函数。回想一下,线性函数是平坦的,因此它们同时是凸函数和凹函数。这很有用,因为某些函数被定义为线性函数的最大值:这些函数不保证是线性的(它们是分段线性的),但保证是凸的。也就是说,即使当我们取线性函数的最大值时我们会失去线性,但我们会得到凸性补偿。

One important type of function to keep in mind, which appears in machine learning, is a function that is the maximum of two or more convex functions. These functions are always convex. Recall that linear functions are flat, so they are at the same time convex and concave. This is useful since some functions are defined as the maxima of linear functions: those are not guaranteed to be linear (they are piecewise linear), but are guaranteed to be convex. That is, even though we lose linearity when we take the maximum of linear functions, we are compensated with convexity.

修正线性单位函数 (ReLU)在神经网络中用作非线性激活函数是定义为两个线性函数的最大值的函数的示例: e L U X = A X 0 , X 。另一个例子是用于支持向量机的铰链损失函数: H X = A X 0 , 1 - t X 其中t为 1 或 –1。

The Rectified Linear Unit function (ReLU) that is used as a nonlinear activation function in neural networks is an example of a function defined as the maximum of two linear functions: R e L U ( x ) = m a x ( 0 , x ) . Another example is the hinge loss function used for support vector machines: H ( x ) = m a x ( 0 , 1 - t x ) where t is either 1 or –1.

请注意,凸函数族的最小值并不保证是凸函数;它可以有一个双井。然而,它们的最大值肯定是凸的。

Note that the minimum of a family of convex functions is not guaranteed to be convex; it can have a double well. However, their maximum is definitely convex.

线性和凸性之间还有一种关系。如果我们有一个凸函数(非线性,因为线性是微不足道的),那么低于我们函数的所有线性函数的最大值恰好等于它。换句话说,凸性取代了线性性,在某种意义上,当线性性不可用但凸性可用时,我们可以用其图形位于函数图形下方的所有线性函数中的最大值来替换凸函数(见图3-8) 。回想一下,凸函数的图在任意点都位于其切线图的上方,并且切线是线性的。当我们有凸函数时,这为我们提供了利用线性函数的简单性的直接途径。当我们考虑所有切线的最大值时,我们有相等性;当我们考虑几个点处的切线的最大值时,我们只有近似值。

There is one more relationship between linearity and convexity. If we have a convex function (nonlinear since linear would be trivial), then the maximum of all the linear functions that stay below our function is exactly equal to it. In other words, convexity replaces linearity, in the sense that when linearity is not available, but convexity is, we can replace our convex function with the maximum of all the linear functions whose graph lies below our function’s graph (see Figure 3-8). Recall that the graph of a convex function lies above the graph of its tangent at any point, and that the tangents are linear. This gives us a direct path to exploiting the simplicity of linear functions when we have convex functions. We have equality when we consider the maximum of all the tangents, and only approximation when we consider the maximum of the tangents at a few points.

埃麦0308
图 3-8。凸函数等于其所有切线的最大值

3-93-10分别显示了非线性凸函数和非凸函数的一般情况。总体而言,凸函数的景观有利于最小化问题。我们不用担心陷入局部最小值,因为任何局部最小值也是凸函数的全局最小值。非凸函数的景观有峰、谷和鞍点。这种情况下的最小化问题存在陷入局部最小值并且永远找不到全局最小值的风险。

Figures 3-9 and 3-10 show the general landscapes of nonlinear convex and nonconvex functions, respectively. Overall, the landscape of a convex function is good for minimization problems. We have no fear of getting stuck at local minima since any local minimum is also a global minimum for a convex function. The landscape of a nonconvex function has peaks, valleys, and saddle points. A minimization problem on such a landscape runs the risk of getting stuck at the local minima and never finding the global minima.

最后,确保您了解凸函数、凸集和凸优化问题(凸函数在凸函数上进行优化)之间的区别。凸集。

Finally, make sure you know the distinction among a convex function, a convex set, and a convex optimization problem, which optimizes a convex function over a convex set.

埃麦0309
图 3-9。凸函数的景观有利于最小化问题。我们不用担心陷入局部最小值,因为任何局部最小值也是凸函数的全局最小值。
埃麦0310
图 3-10。非凸函数的景观有峰、谷和鞍点。这种情况下的最小化问题存在陷入局部最小值并且永远找不到全局最小值的风险。

我们如何找到函数的最小化点?

How do we locate minimizers of functions?

一般来说,有两种方法来定位函数的最小化器(和/或最大化器)。通常需要在以下之间进行权衡:

In general, there are two approaches to locating minimizers (and/or maximizers) of functions. The trade-off is usually between:

  1. 只计算一阶导数,缓慢收敛到最小值(尽管有加速方法来加速收敛)。这些称为梯度方法。梯度是多个变量的函数的一种导数。例如,我们的损失函数是几个函数 ω 的(或一个向量 ω )。

  2. Calculating only one derivative and converging to the minimum slowly (though there are acceleration methods to speed up the convergence). These are called gradient methods. The gradient is one derivative of a function of several variables. For example, our loss function is a function of several ω ’s (or of one vector ω ).

  3. 计算两个导数(计算成本要高得多,这是一个很大的问题,特别是当我们有数千个参数时)并更快地收敛到最小值。通过近似二阶导数而不是精确计算它可以节省一点计算成本。二阶导数方法称为牛顿法。Hessian (二阶导数矩阵)或 Hessian 的近似出现在这些方法中

  4. Calculating two derivatives (computationally much more expensive, which is a big turnoff, especially when we have thousands of parameters) and converging to the minimum faster. Computation costs can be saved a little by approximating the second derivative instead of computing it exactly. Second derivative methods are called Newton’s methods. The Hessian (the matrix of second derivatives) or an approximation of the Hessian appears in these methods.

我们永远不需要超出计算二阶导数的范围。

We never need to go beyond calculating two derivatives.

但是为什么函数的一阶和二阶导数对于定位其优化器如此重要?简洁的答案是,一阶导数包含有关函数在某个点增加或减少的速度的信息(因此,如果您遵循其方向,您可能会上升到最大值或下降到最小值),而二阶导数包含有关函数的形状——如果它向上或向下弯曲。

But why are the first and second derivatives of a function so important for locating its optimizers? The concise answer is that the first derivative contains information on how fast a function increases or decreases at a point (so if you follow its direction, you might ascend to the maximum or descend to the minimum), and the second derivative contains information on the shape of the bowl of the function—if it curves up or curves down.

微积分的一个关键思想仍然是基本的:最小化器(和/或最大化器)发生在临界点(定义为函数的一导数等于零或不存在的点)或边界点。因此,为了找到这些优化器,我们必须搜索边界点(如果我们的搜索空间有边界)内部临界点。

One key idea from calculus remains fundamental: minimizers (and/or maximizers) happen at critical points (defined as the points where one derivative of our function is either equal to zero or does not exist) or at boundary points. So to locate these optimizers, we must search through both the boundary points (if our search space has a boundary) and the interior critical points.

我们如何定位搜索空间内部的关键点?

How do we locate the critical points in the interior of our search space?

方法一
Approach 1

我们按照以下步骤操作:

  • 找到我们函数的一个导数(还不错,我们都在微积分中做到了)。

  • 将其设置为零(我们都可以写符号等于和零)。

  • 解出 ω 使我们的导数为零(这是错误的一步!)。

We follow these steps:

  • Find one derivative of our function (not too bad, we all did it in calculus).

  • Set it equal to zero (we can all write the symbols equal and zero).

  • Solve for the ω ’s that make our derivative zero (this is the bad step!).

对于导数是线性的函数,例如我们的均方误差损失函数,很容易解决这些问题 ω 的。线性代数领域是专门为帮助求解线性方程组而建立的。数值线性代数领域的建立是为了帮助解决病态条件普遍存在的现实大型线性方程组。当我们的系统是线性的时,我们可以使用许多工具(和软件包)。

For functions whose derivatives are linear, such as our mean squared error loss function, it is sort of easy to solve for these ω ’s. The field of linear algebra was especially built to help solve linear systems of equations. The field of numerical linear algebra was built to help solve realistic and large systems of linear equations where ill conditioning is prevalent. We have many tools at our disposal (and software packages) when our systems are linear.

另一方面,当我们的方程是非线性的时,寻找解决方案就完全是另一回事了。这变成了一场要么命中要么失手的游戏,而且大部分都是失手!这是一个简短的示例,说明了求解线性方程和非线性方程之间的区别:

On the other hand, when our equations are nonlinear, finding solutions is an entirely different story. It becomes a hit-or-miss game, with mostly misses! Here’s a short example that illustrates the difference between solving a linear and a nonlinear equation:

求解线性方程
Solving a linear equation

寻找 ω 这样 0 002 ω - 5 = 0

:将5移到另一边,然后除以0.002,我们得到 ω = 5 / 0 002 = 2500 。完毕。

Find ω such that 0 . 002 ω - 5 = 0 .

Solution: Moving the 5 over to the other side, then dividing by 0.002, we get ω = 5 / 0 . 002 = 2500 . Done.

求解非线性方程
Solving a nonlinear equation

寻找 ω 这样 0 002 ω - 5 ω 2 + e ω = 0

解决方案:是的,我要离开这里。我们需要一种数值方法!(请参见图3-11,了解该非线性方程解的图形近似值。)

Find ω such that 0 . 002 sin ( ω ) - 5 ω 2 + e ω = 0 .

Solution: Yes, I am out of here. We need a numerical method! (See Figure 3-11 for a graphical approximation of the solution of this nonlinear equation.)

电子邮件0311
图 3-11。求解非线性方程是很困难的。在这里,我们绘制 F ω = 0 002 ω - 5 ω 2 + e ω 并近似它的三个根(点,其中 F ω = 0 )在图表上。

有许多数值技术专门致力于寻找非线性方程的解(以及致力于数值求解非线性常微分方程和偏微分方程的整个领域)。这些方法找到近似解,然后提供数值解与精确解析解的偏差范围。他们通常构造一个在某些条件下收敛于解析解的序列。有些方法比其他方法收敛得更快,并且比其他方法更适合某些问题。

There are many numerical techniques devoted solely to finding solutions of nonlinear equations (and entire fields devoted to numerically solving nonlinear ordinary and partial differential equations). These methods find approximate solutions, then provide bounds on how far off the numerical solutions are from the exact analytical solutions. They usually construct a sequence that converges to the analytical solution under certain conditions. Some methods converge faster than others, and are better suited for certain problems than others.

方法2
Approach 2

另一种选择是沿着梯度方向向最小值下降或向最大值上升。

Another option is to follow the gradient direction to descend toward the minimum or ascend toward the maximum.

要理解这些梯度类型的方法,请想象一下徒步下山(如果该方法加速或有动力,则可以滑雪下山)。我们从搜索空间中的一个随机点开始,这将我们设置在函数景观上的初始高度水平。现在,该方法将我们移动到搜索空间中的一个新点,并且希望在这个新位置,我们最终达到一个比我们原来的高度水平更低的新高度水平。因此,我们就会下降。我们重复这一点,理想情况下,如果函数的地形配合,则该点序列将收敛到我们正在寻找的函数的最小值。

To understand these gradient-type methods, think of hiking down a mountain (or skiing down the mountain if the method is accelerated or has momentum). We start at a random point in our search space, and that sets us at an initial height level on the landscape of the function. Now the method moves us to a new point in the search space, and hopefully, at this new location, we end up at a new height level that is lower than the height level we came from. Hence, we would have descended. We repeat this and ideally, if the terrain of the function cooperates, this sequence of points will converge toward the minimizer of the function that we are looking for.

当然,对于具有许多山峰和山谷的景观的函数来说,我们从哪里开始——或者换句话说,如何初始化——很重要,因为我们可能会走下一个与我们想要结束的完全不同的山谷。我们最终可能会达到局部最小值而不是全局最小值。

Of course, for functions with landscapes that have many peaks and valleys, where we start–or in other words, how to initialize–matters, since we could descend down an entirely different valley than the one where we want to end up. We might end up at a local minimum instead of a global minimum.

凸函数和下界函数的形状像沙拉碗,因此我们不必担心陷入局部最小值并远离全局最小值。凸函数可能还有另一个令人担忧的来源:当函数的碗形状太窄时,我们的方法可能会变得非常缓慢地收敛。我们将在第 4 章中详细讨论这一点。

Functions that are convex and bounded below are shaped like a salad bowl, so with those we do not worry about getting stuck at local minima and away from global minima. There could be another source of worry with convex functions: when the shape of the bowl of the function is too narrow, our method might become painfully slow to converge. We will go over this in detail in Chapter 4.

方法 1 和方法 2 都很有用且流行。有时,我们别无选择,只能使用其中一种,具体取决于每种方法对于我们特定设置的收敛速度、我们尝试优化的函数的规律性(它有多少个表现良好的导数)等。有时,这只是品味问题。对于线性回归的均方误差损失函数,两种方法都有效,因此我们将使用方法1,只是因为我们将对本书中的所有其他损失函数使用梯度下降方法。

Both Approach 1 and Approach 2 are useful and popular. Sometimes, we have no option but to use one or the other, depending on how fast each method converges for our particular setting, how regular the function we are trying to optimize is (how many well-behaved derivatives it has), etc. Other times, it is just a matter of taste. For linear regression’s mean squared error loss function, both types of methods work, so we will use Approach 1, only because we will use gradient descent methods for all the other loss functions in this book.

我们必须提到,下降方法的徒步下山类比非常好,但有点误导。当我们人类徒步下山时,我们在物理上属于山地景观所在的同一个三维空间,这意味着我们处于一定的海拔,即使蒙着眼睛,我们也能够下降到海拔较低的位置,甚至在雾太大的时候我们也只能一步一步下降。我们感知海拔高度,然后下坡。另一方面,数值下降方法不会在与函数景观嵌入的空间维度相同的空间维度中搜索最小值。相反,它们在地面水平上搜索,即景观下方的一个维度(见图3 ) -12)。这使得下降到最小值变得更加困难,因为在地面上,我们可以从任何点移动到任何其他点,而不知道我们上方存在什么高度,直到我们评估该点处的函数本身并找到高度。因此,我们的方法可能会意外地将我们从高于我们一定高度的一个地面点移动到另一个具有更高海拔的地面点,因此距离最小值更远。这就是为什么在地面上找到一个快速降低功能高度的方向,以及我们可以在地面上沿该方向移动多远(步长),同时仍然降低我们上方的功能高度非常重要的原因。步长也称为学习率超参数每次使用下降法时我们都会遇到。

We must mention that the hiking-down-the-mountain analogy for descent methods is excellent but a tad bit misleading. When we humans hike down a mountain, we physically belong in the same three-dimensional space that our mountain landscape exists in, meaning we are at a certain elevation and we are able to descend to a location at a lower elevation, even with a blindfold, and even when it is too foggy and we can only descend one tiny step at a time. We sense the elevation, then move downhill. Numerical descent methods, on the other hand, do not search for the minimum in the same space dimension as the one the landscape of the function is embedded in. Instead, they search on the ground level, one dimension below the landscape (see Figure 3-12). This makes descending toward the minimum much more difficult, since at ground level we can move from any point to any other without knowing what height level exists above us, until we evaluate the function itself at the point and find the height. So our method might accidentally move us from one ground point with a certain elevation above us to another ground point with a higher elevation, hence farther away from the minimum. This is why it is important to locate, on ground level, a direction that quickly decreases the function height, and how far we can move in that direction on ground level (step size) while still decreasing the function height above us. The step size is also called the learning rate hyperparameter, which we will encounter every time we use a descent method.

回到我们的主要目标:我们想要找到最好的 ω 对于我们的训练函数,因此我们必须使用方法 1 最小化均方误差损失函数:取损失函数的一阶导数并将其设置为零,然后求解向量 ω 。为此,我们需要掌握线性代数表达式的微积分。让我们回顾一下我们的微积分课第一的。

Back to our main goal: we want to find the best ω for our training function, so we must minimize our mean squared error loss function using Approach 1: take one derivative of the loss function and set it equal to zero, then solve for the vector ω . For this, we need to master doing calculus on linear algebra expressions. Let’s revisit our calculus class first.

简而言之,微积分

Calculus in a nutshell

在一个第一门微积分课程,我们学习单变量函数( F ω )、它们的图表,以及如何在某些点评估它们。然后我们学习数学分析中最重要的运算:极限。从极限概念出发,我们定义了函数的连续性和不连续性、一点的导数 F ' ω (通过点的割线斜率的极限),以及域上的积分(由域上的函数确定的迷你区域和的极限)。我们以微积分的基本定理结束这门课,将积分和微分作为逆运算联系起来。导数的关键属性之一是它决定函数在某一点增加或减少的速度;因此,它在定位函数在其域内部的最小值和/或最大值方面起着至关重要的作用(边界点是分开的)。

In a first course on calculus, we learn about functions of single variables ( f ( ω ) ), their graphs, and how to evaluate them at certain points. Then we learn about the most important operation in mathematical analysis: the limit. From the limit concept, we define continuity and discontinuity of functions, derivative at a point f ' ( ω ) (limit of slopes of secants through a point), and integral over a domain (limit of sums of mini regions determined by the function over the domain). We end the class with the fundamental theorem of calculus, relating integration and differentiation as inverse operations. One of the key properties of the derivative is that it determines how fast a function increases or decreases at a certain point; hence, it plays a crucial role in locating a minimum and/or maximum of a function in the interior of its domain (boundary points are separate).

在多变量微积分课程(通常是微积分的第三门课程)中,许多思想都是从单变量微积分转移过来的,包括导数(现在称为梯度,因为我们有多个变量)在定位任何内部最小值和/或最大值时的重要性。梯度 F ω F ω 是函数关于向量的导数 ω 变量。

In a multivariable calculus course, which is usually a third course in calculus, many ideas transfer from single-variable calculus, including the importance of the derivative, now called the gradient because we have several variables, in locating any interior minima and/or maxima. The gradient ( f ( ω ) ) of f ( ω ) is the derivative of the function with respect to the vector ω of variables.

在深度学习中,未知权重以矩阵形式组织,而不是以向量形式组织,因此我们需要对函数求导 F 相对于变量矩阵W。

In deep learning, the unknown weights are organized in matrices, not in vectors, so we need to take the derivative of a function f ( W ) with respect to a matrix W of variables.

出于我们在人工智能中的目的,我们需要计算其导数的函数是损失函数,它内置了训练函数。根据导数的链式法则,我们还需要计算训练函数相对于 ω 的。

For our purposes in AI, the function whose derivative we need to calculate is the loss function, which has the training function built into it. By the chain rule for derivatives, we would also need to calculate the derivative of the training function with respect to the ω ’s.

让我们使用单变量微积分中的一个简单示例进行演示,然后立即过渡到线性代数的导数表达式。

Let’s demonstrate using one simple example from single-variable calculus, then immediately transition to taking derivatives of linear algebra expressions.

一维优化示例

A one-dimensional optimization example

问题:找到函数的极小值和最小值(如果有) F ω = 3 + 05ω-2 2 在区间 [–1,6] 上。

Problem: Find the minimizer(s) and the minimum value (if any) of the function f ( ω ) = 3 + (0.5ω-2) 2 on the interval [–1,6].

要做到这一点,要走的路是不可能的,那就是尝试无限多个 ω –1 和 6 之间,然后选择 ω 给出最低的f值。另一种方法是利用我们的微积分知识,优化器(最小化器和/或最大化器)发生在关键点(导数不存在或为零)或边界点。参考图3-13

One impossibly long way to go about this is to try out the infinitely many values of ω between –1 and 6, and choose the ω ’s that give the lowest f value. Another way is to use our calculus knowledge that optimizers (minimizers and/or maximizers) happen either at critical points (where the derivative is either nonexistent or zero) or at boundary points. For reference, see Figure 3-13.

埃迈 0313
图 3-13。函数的最小值 F ω = 3 + 05ω-2 2 在区间 [–1,6] 上为 3 并且发生在临界点 ω = 4 。在这个临界点,函数的导数为零,这意味着如果我们画一条切线,它将是水平的。

我们的边界点是 –1 和 6,因此我们首先在这些点评估我们的函数: F - 1 = 3 + 05-1-2 2 = 9 25 F 6 = 3 + 056-2 2 = 4 。显然,–1 不是最小化器,因为f (6)< f (–1),因此该边界点退出竞争,现在只有边界点 6 与内部临界点竞争。为了找到临界点,我们检查函数在区间 [–1,6] 内部的导数: F ' ω = 0 + 2 0 5 ω - 2 * 0 5 = 0 25 0 5 ω - 2 。将该导数设置为零,我们有 0 25 0 5 ω - 2 = 0 ,这意味着 ω = 4 。因此,我们只找到了一个临界点 ω = 4 在区间 [–1,6] 的内部。在这个特殊点,函数的值为 F 4 = 3 + 054-2 2 = 3 由于这里f的值是最低的,我们显然已经找到了最小化竞赛的获胜者,即 ω = 4 最小f值等于至 3.

Our boundary points are –1 and 6, so we evaluate our function at these points first: f ( - 1 ) = 3 + (0.5(-1)-2) 2 = 9 . 25 and f ( 6 ) = 3 + (0.5(6)-2) 2 = 4 . Obviously, –1 is not a minimizer since f(6)<f(–1), so this boundary point gets out of the competition and now only boundary point 6 is competing with interior critical point(s). In order to find our critical points, we inspect the derivative of the function in the interior of the interval [–1,6]: f ' ( ω ) = 0 + 2 ( 0 . 5 ω - 2 ) * 0 . 5 = 0 . 25 ( 0 . 5 ω - 2 ) . Setting this derivative equal to zero, we have 0 . 25 ( 0 . 5 ω - 2 ) = 0 , implying that ω = 4 . Thus, we only found one critical point ω = 4 in the interior of the interval [–1,6]. At this special point, the value of the function is f ( 4 ) = 3 + (0.5(4)-2) 2 = 3 . Since the value of f is the lowest here, we have obviously found the winner of our minimization competition, namely, ω = 4 with the minimum f value equal to 3.

我们一直使用的线性代数表达式的导数

Derivatives of linear algebra expressions that we use all the time

这是可以有效地直接计算涉及向量和矩阵的表达式的导数,而无需将它们解析为其组件。以下两种比较流行:

It is efficient to calculate derivatives directly on expressions involving vectors and matrices, without having to resolve them into their components. The following two are popular:

  1. 一个 ω 是标量,a是常数,是 F ω = A ω F ' ω = A 。什么时候 A ω 是向量(具有相同的长度)并且条目 A 是常数,那么梯度 F ω = A t ω F ω = A 。类似地,梯度 F ω = w t A F ω = A

  2. When a and ω are scalars and a is constant, the derivative of f ( ω ) = a ω is f ' ( ω ) = a . When a and ω are vectors (of the same length) and the entries of a are constant, then the gradient of f ( ω ) = a t ω is f ( ω ) = a . Similarly, the gradient of f ( ω ) = w t a is f ( ω ) = a .

  3. s是标量且恒定且 ω 是标量,则二次函数的导数 F ω = s ω 2 F ' ω = 2 s ω 。类似的高维情况是当S是具有常数项的对称矩阵时,则函数 F ω = ω t S ω 是二次的,它的梯度是 F ω = 2 S ω

  4. When s is scalar and constant and ω is scalar, then the derivative of the quadratic function f ( ω ) = s ω 2 is f ' ( ω ) = 2 s ω . The analogous high-dimensional case is when S is a symmetric matrix with constant entries, then the function f ( ω ) = ω t S ω is quadratic and its gradient is f ( ω ) = 2 S ω .

最小化均方误差损失函数

Minimizing the mean squared error loss function

我们是最后准备最小化均方误差损失函数:

We are finally ready to minimize the mean squared error loss function:

L ω = 1 Xω -y tre t X ω - y tre

让我们在获取其梯度并将其设置为零之前打开该表达式:

Let’s open that expression up before taking its gradient and setting it equal to zero:

L ω = 1 Xω t - y tre t X ω - y tre = 1 ω t X t - y tre t X ω - y tre = 1 ω t X t X ω - ω t X t y tre - y tre t X ω + y tre t y tre = 1 ω t S ω - ω t A - A t ω + y tre t y tre ,

在最后一步中我们设置的位置 X t X = S X t y tre = A 。接下来,取最后一个表达式相对于 ω 并将其设置为零。在计算梯度时,我们使用刚刚学到的线性代数表达式微分:

where in the last step we set X t X = S and X t y true = a . Next, take the gradient of the last expression with respect to ω and set it equal to zero. When calculating the gradient, we use what we just learned about differentiating linear algebra expressions:

L ω = 1 2 S ω - A - A + 0 = 0

现在很容易解决 ω :

Now it is easy to solve for ω :

1 2 S ω - 2 A = 0

所以:

so:

2 S ω = 2 A

这使:

which gives:

ω = S -1 A

现在回想一下我们设置的 S = X t X A = X t y tre ,所以让我们重写我们的最小化 ω 就训练集X(用 1 增强)和相应的标签向量而言 y tre :

Now recall that we set S = X t X and a = X t y true , so let’s rewrite our minimizing ω in terms of the training set X (augmented with ones) and the corresponding labels vector y true :

ω = X t X -1 X t y tre

对于 Fish Market 数据集,最终结果是(请参阅随附的 Jupyter 笔记本):

For the Fish Market data set, this ends up being (see the accompanying Jupyter notebook):

ω = ω 0 ω 1 ω 2 ω 3 ω 4 ω 5 = - 第475章 19929130109716 82 84970118 - 28 85952426 - 28 50769512 29 82981435 30 97250278

将大矩阵相互相乘的成本非常昂贵;将矩阵乘以向量

Multiplying Large Matrices by Each Other Is Very Expensive; Multiply Matrices by Vectors Instead

尝试不惜一切代价避免矩阵相乘;相反,将矩阵与向量相乘。例如,在正规方程中 ω = X t X -1 X t y tre , 计算 X t y tre 首先,避免计算 X t X -1 共。解决这个问题的方法是解决线性系统 X ω = y tre 使用X的伪逆(查看随附的 Jupyter 笔记本)。我们将在第 11 章中讨论伪逆,但现在,它允许我们对没有逆的矩阵进行求逆(相当于除以)。

Try to avoid multiplying matrices by each other at all costs; instead, multiply your matrices with vectors. For example, in the normal equation ω = (X t X) -1 X t y true , compute X t y true first, and avoid computing (X t X) -1 altogether. The way around this is to solve instead the linear system X ω = y true using the pseudoinverse of X (check the accompanying Jupyter notebook). We will discuss the the pseudoinverse in Chapter 11, but for now, it allows us to invert (which is equivalent to divide by) matrices that do not have an inverse.

我们刚刚找到了权重向量 ω 这给出了我们的训练数据和线性回归训练函数之间的最佳拟合:

We just located the weights vector ω that gives the best fit between our training data and the linear regression training function:

F ω ; X = ω 0 + ω 1 X 1 + ω 2 X 2 + ω 3 X 3 + ω 4 X 4 + ω 5 X 5

我们使用解析方法(计算损失函数的梯度并将其设置为零)来导出正规方程给出的解。这是我们能够得出解析解的极少数情况之一。所有其他求最小化的方法 ω 将是数字。

We used an analytical method (compute the gradient of the loss function and set it equal to zero) to derive the solution given by the normal equation. This is one of the very rare instances where we are able to derive an analytical solution. All other methods for finding the minimizing ω will be numerical.

我们永远不想让训练数据拟合得太好

We Never Want to Fit the Training Data Too Well

ω = X t X -1 X t y tre 我们计算得出 ω 使训练函数最适合训练数据的值,但拟合得太好意味着训练函数也可能会拾取噪声,而不仅仅是数据中的信号。所以刚才提到的解决方案,甚至最小化问题本身,都需要修改,以免太拟合正规化提前停止在这里很有帮助。我们将在第 4 章中花一些时间讨论这些内容。

The ω = (X t X) -1 X t y true that we calculated gives the ω values that make the training function best fit the training data, but too good of a fit means that the training function might also be picking up on the noise and not only on the signal in the data. So the solution just mentioned, or even the minimization problem itself, needs to be modified in order to not get too good of a fit. Regularization or early stopping are helpful here. We will spend some time on those in Chapter 4.

这是回归的漫长道路。我们必须在途中学习微积分和线性代数,因为我们才刚刚开始。展示即将推出的机器学习模型(逻辑回归、支持向量机、决策树和随机森林)将会更快,因为我们所做的就是将完全相同的想法应用于不同的功能。

This was the long way to regression. We had to pass through calculus and linear algebra on the way, because we are just starting. Presenting the upcoming machine learning models—logistic regression, support vector machines, decision trees, and random forests—will be faster, since all we do is apply the exact same ideas to different functions.

逻辑回归:分为两类

Logistic Regression: Classify into Two Classes

逻辑回归主要用于分类任务。我们首先解释如何使用该模型进行二元分类任务(分为两类,例如癌症/非癌症、对儿童安全/不安全、可能偿还贷款/不太可能等)。然后我们将模型推广到分类为多个类别(例如,将数字的手写图像分类为0、1、2、3、4、5、6、7、8或9)。同样,我们有相同的数学设置:

Logistic regression is mainly used for classification tasks. We first explain how we can use this model for binary classification tasks (classify into two classes, such as cancer/not cancer, safe for children/not safe, likely to pay back a loan/unlikely, etc.). Then we will generalize the model into classifying into multiple classes (for example, classify handwritten images of digits into 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9). Again, we have the same mathematical setup:

  1. 训练功能

  2. Training function

  3. 损失函数

  4. Loss function

  5. 优化

  6. Optimization

培训功能

Training Function

如同线性回归,逻辑回归的训练函数计算特征的线性组合并添加一个常数偏差项,但它不是按原样输出结果,而是将其传递给逻辑函数 ,其图形如图3-14所示,其公式为:

Similar to linear regression, the training function for logistic regression computes a linear combination of the features and adds a constant bias term, but instead of outputting the result as is, it passes it through the logistic function, whose graph is plotted in Figure 3-14, and whose formula is:

σ s = 1 1+e -s

这是一个只取 0 到 1 之间的值的函数,因此它的输出可以解释为数据点属于某个类别的概率。如果输出小于0.5,则将数据点分类为属于第一类,如果输出大于0.5,则将数据点分类为另一类。数字 0.5 是做出对数据点进行分类的决定的阈值。

This is a function that only takes values between 0 and 1, so its output can be interpreted as a probability of a data point belonging to a certain class. If the output is less than 0.5, then classify the data point as belonging to the first class, and if the output is greater than 0.5, then classify the data point in the other class. The number 0.5 is the threshold where the decision to classify the data point is made.

埃迈 0314
图 3-14。逻辑函数图 σ s = 1 1+e -s 请注意,该函数可以在任何s处进行计算,并且始终输出01之间的数字,因此其输出可以解释为概率。

因此,这里的训练函数最终是特征加上偏差的线性组合,首先与逻辑函数组成,最后与阈值函数组成:

Therefore, the training function here ends up being a linear combination of features, plus bias, composed first with the logistic function, then finally composed with a thresholding function:

y = 时间 H r e s H σ ω 0 + ω 1 X 1 + + ω n X n

与线性回归的情况类似, ω 是我们需要优化损失函数的未知数。就像线性回归一样,这些未知数的数量等于数据特征的数量,再加上偏置项 1。对于图像分类等任务,每个像素都是一个特征,因此我们可以有数千个那些。

Similar to the linear regression case, the ω ’s are the unknowns for which we need to optimize our loss function. Just like linear regression, the number of these unknowns is equal to the number of data features, plus one for the bias term. For tasks like classifying images, each pixel is a feature, so we could have thousands of those.

损失函数

Loss Function

让我们设计一个好的分类损失函数。我们是工程师,我们想要惩罚错误分类的训练数据点。在我们的标记数据集中,如果一个实例属于一个类,那么它的 y tre = 1 ,如果没有,那么它 y tre = 0

Let’s design a good loss function for classification. We are the engineers and we want to penalize wrongly classified training data points. In our labeled data set, if an instance belongs in a class, then its y true = 1 , and if it doesn’t, then its y true = 0 .

我们希望我们的训练函数能够输出 y predCt = 1 对于属于正类的训练实例(其 y tre 也是 1)。成功的 ω 值应该给出较高的t值(线性组合步骤的结果)以进入逻辑函数,因此为正实例分配高概率并通过 0.5 阈值以获得 y predCt = 1 。因此,如果线性组合加上偏差步骤给出较低的t值,同时 y tre = 1 ,对其进行处罚。

We want our training function to output y predict = 1 for training instances that belong in the positive class (whose y true is also 1). Successful ω values should give a high value of t (result of the linear combination step) to go into the logistic function, hence assigning high probability for positive instances and passing the 0.5 threshold to obtain y predict = 1 . Therefore, if the linear combination plus bias step gives a low t value while y true = 1 , penalize it.

同样,成功的权重值应该给出一个较低的t值,以进入逻辑函数,用于训练不属于该类的实例(它们的真实值) y tre = 0 )。因此,如果线性组合加上偏差步骤给出高t值,同时 y tre = 0 ,对其进行处罚。

Similarly, successful weight values should give a low t value to go into the logistic function for training instances that do not belong in the class (their true y true = 0 ). Therefore, if the linear combination plus bias step gives a high t value while y true = 0 , penalize it.

那么我们如何找到一个损失函数来惩罚错误分类的训练数据点呢?假阳性和假阴性都应该受到惩罚。回想一下,该分类模型的输出要么是 1,要么是 0:

So how do we find a loss function that penalizes a wrongly classified training data point? Both false positives and false negatives should be penalized. Recall that the outputs of this classification model are either 1 or 0:

  • 考虑一个奖励 1 惩罚 0 的微积分函数: - 日志 s (见图3-15)。

  • Think of a calculus function that rewards 1 and penalizes 0: - log ( s ) (see Figure 3-15).

  • 考虑一个惩罚 1 奖励 0 的微积分函数: - 日志 1 - s (见图3-15)。

  • Think of a calculus function that penalizes 1 and rewards 0: - log ( 1 - s ) (see Figure 3-15).

埃麦0315
图 3-15。左:函数图 F s = - G s 。此函数为接近0 的数字分配高值,为接近1 的数字分配低值。右:函数图 F s = - G 1 - s 。此函数为接近1 的数字分配高值,为接近0的数字分配低值。

现在关注逻辑函数的输出 σ s 对于当前的选择 ω 的:

Now focus on the output of the logistic function σ ( s ) for the current choice of ω ’s:

  • 如果 σ s 小于 0.5(模型预测为 y predCt = 0 )但是真实的 y tre = 1 (假阴性),让模型通过惩罚来付出代价 - 日志 σ s 。如果相反 σ s > 0 5 ,模型预测为 y predCt = 1 )(真阳性), - 日志 σ s 规模较小,因此不会支付高额罚款。

  • If σ ( s ) is less than 0.5 (model prediction is y predict = 0 ) but the true y true = 1 (a false negative), make the model pay by penalizing - log ( σ ( s ) ) . If instead σ ( s ) > 0 . 5 , the model prediction is y predict = 1 ) (a true positive), - log ( σ ( s ) ) is small, so no high penalty is paid.

  • 同样,如果 σ s 虽大于0.5,但真实情况 y tre = 0 (误报),让模型通过惩罚来付出代价 - 日志 1 - σ s 。同样,对于真正的否定,也不会支付高额罚款。

  • Similarly, if σ ( s ) is more than 0.5, but the the true y true = 0 (a false positive), make the model pay by penalizing - log ( 1 - σ ( s ) ) . Again, for a true negative no high penalty is paid either.

因此,我们可以写出错误分类一个训练实例的成本 X 1 , X 2 , , X n ; y tre 作为:

Therefore, we can write the cost for misclassifying one training instance ( x 1 i , x 2 i , , x n i ; y true ) as:

C s t = - 日志 σ s 如果 y tre = 1 - 日志 1 - σ s 如果 y tre = 0 = - y tre 日志 σ s - 1 - y tre 日志 1 - σ s

最后,损失函数是m 个训练实例的平均成本,给出了以下公式:流行的交叉熵损失函数

Finally, the loss function is the average cost over m training instances, giving us the formula for the popular cross-entropy loss function:

L ω = - 1 Σ =1 y tre 日志 σ ω 0 + ω 1 X 1 + + ω n X n + 1 - y tre 日志 1 - σ ω 0 + ω 1 X 1 + + ω n X n

优化

Optimization

不像线性回归情况,如果我们决定通过设置来最小化损失函数 L ω = 0 ,没有封闭形式的解决方案公式 ω 的。好消息是这个函数是凸函数,因此第 4 章中的梯度下降(或随机或小批量梯度下降)保证找到最小值(如果学习不是太大并且我们等待足够长的时间)。

Unlike the linear regression case, if we decide to minimize the loss function by setting L ( ω ) = 0 , there is no closed form solution formula for the ω ’s. The good news is that this function is convex, so the gradient descent in Chapter 4 (or stochastic or mini-batch gradient descents) is guaranteed to find a minimum (if the learning rate is not too large and if we wait long enough).

Softmax 回归:分类为多个类

Softmax Regression: Classify into Multiple Classes

我们可以轻松概括逻辑回归思想以分类为多个类别。这种非二元分类任务的一个著名例子是使用MNIST 数据集对 10 个手写数字 0、1、2、3、4、5、6、7、8 和 9 的图像进行分类。该数据集包含 70,000 张手写数字图像(这些图像的样本见图3-16),分为 60,000 张图像训练子集和 10,000 张图像测试子集。每张图像都标有其所属的类别,即 10 个数字之一。

We can easily generalize the logistic regression idea to classify into multiple classes. A famous example for such a nonbinary classification task is classifying images of the 10 handwritten numerals 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 using the MNIST data set. This data set contains 70,000 images of handwritten numerals (see Figure 3-16 for samples of these images), split into a 60,000-image training subset and a 10,000-image test subset. Each image is labeled with the class it belongs to, which is one of the 10 numerals.

该数据集还包含许多分类模型的结果,包括线性分类器、k-近邻决策树、具有各种内核的支持向量机和具有各种架构的神经网络,以及相应论文的参考文献及其出版年份。随着时间的推移和方法的发展,看到性能的进步是很有趣的。

This data set also contains results from many classifying models, including linear classifiers, k-nearest neighbors, decision trees, support vector machines with various kernels, and neural networks with various architectures, along with references for the corresponding papers and their years of publication. It is interesting to see the progress in performance as the years go by and the methods evolve.

埃迈 0316
图 3-16。来自 MNIST 数据集的示例图像(图像来源

不要将多类分类与多输出模型混淆

Do Not Confuse Classifying into Multiple Classes with Multioutput Models

Softmax 回归预测一次一类,所以我们不能用它来分类,例如,同一张图像中的五个人。相反,我们可以用它来检查给定的 Facebook 图片是否是我、我的姐妹、我的兄弟、我的丈夫或我的女儿的照片。传递到 softmax 回归模型的图像只能包含我们五个人中的一个,否则模型的分类会不太明显。这意味着我们的类必须是互斥的。因此,当 Facebook 自动标记同一张图像中的五个人时,他们使用的是多输出模型,而不是 softmax 回归模型。

Softmax regression predicts one class at a time, so we cannot use it to classify, for example, five people in the same image. Instead, we can use it to check whether a given Facebook image is a picture of me, my sister, my brother, my husband, or my daughter. An image passed into the softmax regression model can have only one of the five of us, or the model’s classification would be less obvious. This means that our classes have to be mutually exclusive. So when Facebook automatically tags five people in the same image, they are using a multioutput model, not a softmax regression model.

假设我们有一个数据点的特征,并且我们想使用此信息将数据点分类为k个可能的类别之一。以下是训练函数、损失函数以及优化过程现在应该清楚了。

Suppose we have the features of a data point and we want to use this information to classify the data point into one of k possible classes. The following training function, loss function, and optimization process should be clear by now.

图像数据的特征

Features of Image Data

对于灰度图像,每个像素强度是一种特征,因此图像通常具有数千个特征。灰度图像通常表示为二维数字矩阵,其中像素强度作为矩阵项。彩色图像有红色、绿色和蓝色三个通道,其中每个通道再次表示为二维数字矩阵,并且通道彼此堆叠,形成三层二维矩阵。这种结构称为张量。查看本书GitHub 页面上有关处理图像的笔记本,其中说明了如何在 Python 中处理灰度和彩色图像。

For grayscale images, each pixel intensity is a feature, so images usually have thousands of features. Grayscale images are usually represented as two-dimensional matrices of numbers, with pixel intensities as the matrix entries. Color images come in three channels, red, green, and blue, where each channel is again represented as a two-dimensional matrix of numbers, and the channels are stacked on top of each other, forming three layers of two-dimensional matrices. This structure is called a tensor. Check out the notebook on processing images at this book’s GitHub page that illustrates how we can work with grayscale and color images in Python.

培训功能

Training Function

第一步始终相同:线性组合特征并添加恒定偏差项。在逻辑回归中,当我们只有两个类时,我们将结果传递到公式的逻辑函数中:

The first step is always the same: linearly combine the features and add a constant bias term. In logistic regression, when we only had two classes, we passed the result into the logistic function of the formula:

σ s = 1 1+e -s = 1 1+1 e s = e s 1+e s = e s e 0 +e s

我们将其解释为数据点属于或不属于感兴趣类别的概率。请注意,我们将逻辑函数的公式重写为 σ s = e s e 0 +e s 强调这样一个事实:它捕获了两个概率,每个类别一个。换句话说, σ s 给出数据点属于感兴趣类别的概率,并且 1 - σ s = e 0 e 0 +e s 给出数据点不属于该类的概率。

which we interpreted as the probability of the data point belonging in the class of interest or not. Note that we rewrote the formula for the logistic function as σ ( s ) = e s e 0 +e s to highlight the fact that it captures two probabilities, one for each class. In other words, σ ( s ) gives the probability that a data point is in the class of interest, and 1 - σ ( s ) = e 0 e 0 +e s gives the probability that the data point is not in the class.

当我们有多个类而不是只有两个类时,对于同一个数据点,我们多次重复相同的过程:每个类一次。每个类都有自己的偏差和一组线性组合特征的权重,从而给出具有特征值的数据点 X 1 , X 2 ,...​,以及 X n ,我们计算k 个不同的线性组合加上偏差:

When we have multiple classes instead of only two, then for the same data point, we repeat same process multiple times: one time for each class. Each class has its own bias and set of weights that linearly combine the features, thus given a data point with feature values x 1 , x 2 ,…​, and x n , we compute k different linear combinations plus biases:

s 1 = ω 0 1 + ω 1 1 X 1 + ω 2 1 X 2 + + ω n 1 X n s 2 = ω 0 2 + ω 1 2 X 1 + ω 2 2 X 2 + + ω n 2 X n s k = ω 0 k + ω 1 k X 1 + ω 2 k X 2 + + ω n k X n

养成好习惯

Get into Good Habits

你想养成记录有多少未知数的好习惯 ω 最终出现在训练函数的公式中。回想一下,这些是 ω 是我们通过最小化损失函数找到的。另一个好习惯是采用高效且一致的方式在整个模型中组织它们(以向量、矩阵等形式)。在 softmax 情况下,当我们有k个类,每个数据点有n 个特征时,我们最终得到 k × n ω 的线性组合,然后k 个偏差,总共 k × n + k 未知 ω 的。例如,如果我们使用softmax回归模型对手写数字的MNIST数据集中的图像进行分类,则每个图像有 28 × 28 像素,意味着 784 个特征,我们希望将它们分为 10 个类别,因此我们最终必须针对 7850 个进行优化 ω 的。对于线性回归模型和逻辑回归模型,我们只需要 n + 1 未知 ω 是我们需要优化的。

You want to get into the good habit of keeping track of how many unknown ω ’s end up in the formula for your training function. Recall that these are the ω ’s that we find via minimizing a loss function. The other good habit is having an efficient and consistent way to organize them throughout your model (in a vector, matrix, etc.). In the softmax case, when we have k classes, and n features for each data point, we end up with k × n ω ’s for the linear combinations, then k biases, for a total of k × n + k unknown ω ’s. For example, if we use a softmax regression model to classify images in the MNIST data set of handwritten numerals, each image has 28 × 28 pixels, meaning 784 features, and we want to classify them into 10 classes, so we end up having to optimize for 7850 ω ’s. For both the linear and logistic regression models, we only had n + 1 unknown ω ’s that we needed to optimize for.

接下来,我们将这k个结果中的每一个传递到softmax 函数中,该函数将逻辑函数从两个类别推广到多个类别,并且我们还将其解释为概率。softmax 函数的公式如下:

Next we pass each of these k results into the softmax function, which generalizes the logistic function from two to multiple classes, and we also interpret it as a probability. The formula for the softmax function looks like:

σ s j = e s j e s 1 +e s 2 ++e s k

这样,同一个数据点将得到k 个概率分数,每个类别对应一个分数。最后,我们将数据点分类为属于获得最大概率得分的类别。

This way, the same data point will get k probability scores, one score corresponding to each class. Finally, we classify the data point as belonging to the class where it obtained the largest probability score.

汇总以上所有内容,我们得到了训练函数的最终公式,我们现在可以将其用于分类(即,在找到最佳 ω 通过最小化适当的损失函数来获得值):

Aggregating all of the above, we obtain the final formula of the training function that we can now use for classification (that is, after we find the optimal ω values by minimizing an appropriate loss function):

y = j 这样的 σ ω 0 j + ω 1 j X 1 + + ω n j X n 最大

请注意,对于此训练函数,我们所要做的就是输入数据特征(x),然后它返回一个类别编号:j

Note that for this training function, all we have to do is input the data features (the x values), and it returns one class number: j.

Logistic 和 Softmax 函数以及统计力学

The Logistic and Softmax Functions and Statistical Mechanics

如果您熟悉统计力学中,您可能已经注意到,logistic 和 softmax 函数计算概率的方式与统计力学领域的配分函数计算在特定条件下找到系统的概率的方式相同。状态。

If you are familiar with statistical mechanics, you might have noticed that the logistic and softmax functions calculate probabilities in the same way the partition function from the field of statistical mechanics calculates the probability of finding a system in a certain state.

损失函数

Loss Function

我们推导出逻辑回归情况下的交叉熵损失函数:

We derived the cross-entropy loss function in the case of logistic regression:

L ω = - 1 Σ =1 y tre 日志 σ ω 0 + ω 1 X 1 + + ω n X n + 1 - y tre 日志 1 - σ ω 0 + ω 1 X 1 + + ω n X n

使用:

using:

C s t = - 日志 σ s 如果 y tre = 1 - 日志 1 - σ s 如果 y tre = 0 = - y tre 日志 σ s - 1 - y tre 日志 1 - σ s

现在我们将相同的逻辑推广到多个类。让我们使用这样的符号 y tre, = 1 如果某个数据点属于第i类,否则为零。然后我们将与错误分类某个数据点相关的成本计算为:

Now we generalize the same logic to multiple classes. Let’s use the notation that y true,i = 1 if a certain data point belongs in the ith class, and is zero otherwise. Then we have the cost associated with misclassifying a certain data point as:

C s t = - 日志 σ s 1 如果 y tre,1 = 1 - 日志 σ s 2 如果 y tre,2 = 1 - 日志 σ s 3 如果 y tre,3 = 1 - 日志 σ s k 如果 y tre,k = 1 = - y tre,1 日志 σ s 1 - - y tre,k 日志 σ s k

对训练集中所有m个数据点进行平均,我们得到广义交叉熵损失函数,将其从仅两类的情况推广到多类的情况:

Averaging over all the m data points in the training set, we obtain the generalized cross-entropy loss function, generalizing it from the case of only two classes to the case of multiple classes:

L ω = - 1 Σ =1 y tre,1 日志 σ ω 0 1 + ω 1 1 X 1 + + ω n 1 X n + y tre,2 日志 σ ω 0 2 + ω 1 2 X 1 + + ω n 2 X n + + y tre,k 日志 σ ω 0 k + ω 1 k X 1 + + ω n k X n

优化

Optimization

现在我们有一个损失函数的公式,我们可以搜索它的最小化 ω 的。与我们将遇到的大多数损失函数一样,就训练集及其目标标签而言,该损失函数的最小值没有明确的公式,因此我们决定使用数值方法找到最小值,特别是:梯度下降、随机梯度下降或小批量梯度下降(参见第 4 章)。同样,广义交叉熵损失函数的凸性在最小化过程中对我们有利,因此我们保证找到我们想要的 ω 的。

Now that we have a formula for the loss function, we can search for its minimizing ω ’s. As with most of the loss functions that we will encounter, there is no explicit formula for the minimizers of this loss function in terms of the training set and their target labels, so we settle for finding the minimizers using numerical methods, in particular: gradient descent, stochastic gradient descent, or mini-batch gradient descent (see Chapter 4). Again, the generalized cross-entropy loss function has its convexity working to our advantage in the minimization process, so we are guaranteed to find our sought-after ω ’s.

交叉熵与信息论

Cross Entropy and Information Theory

交叉熵的概念是借用信息论。我们将在本章后面讨论决策树时详细阐述这一点。现在,请记住以下数量,其中p是事件发生的概率:

The cross-entropy concept is borrowed from information theory. We will elaborate on this when discussing decision trees later in this chapter. For now, keep the following quantity in mind, where p is the probability of an event occurring:

日志 1 p = - 日志 p

当p很小时,这个数量就很大,因此,它量化了可能性较小的更大的惊喜 事件。

That quantity is large when p is small, therefore, it quantifies bigger surprise for less probable events.

将这些模型合并到神经网络的最后一层

Incorporating These Models into the Last Layer of a Neural Network

线性回归模型通过适当地线性组合数据特征,然后添加偏差来进行预测。逻辑回归和 Softmax 回归模型通过适当地线性组合数据特征、添加偏差,然后将结果传递到概率评分函数来进行分类。在这些简单模型中,数据特征仅线性组合,因此,这些模型在识别数据特征之间潜在重要的非线性相互作用方面较弱。神经网络模型将非线性激活函数合并到其训练函数中,在多个层上执行此操作,因此能够更好地检测非线性和更复杂的关系。神经网络的最后一层是其输出层。最后一层之前的层吐出一些高阶特征并将它们输入到最后一层。如果我们希望我们的网络将数据分类为多个类别,那么我们可以将最后一层设为 softmax 层;如果我们希望它分为两类,那么我们的最后一层可以是逻辑回归层;如果我们希望网络能够预测数值,那么我们可以将其最后一层设为回归层。我们将在第 5 章中看到这些例子。

The linear regression model makes its predictions by appropriately linearly combining data features, then adding bias. The logistic regression and the softmax regression models make their classifications by appropriately linearly combining data features, adding bias, then passing the result into a probability scoring function. In these simple models, the features of the data are only linearly combined, hence, these models are weak in terms of picking up on potentially important nonlinear interactions among the data features. Neural network models incorporate nonlinear activation functions into their training functions, do this over multiple layers, and hence are better equipped to detect nonlinear and more complex relationships. The last layer of a neural network is its output layer. The layer right before the last layer spits out some higher-order features and inputs them into the last layer. If we want our network to classify data into multiple classes, then we can make our last layer a softmax layer; if we want it to classify into two classes, then our last layer can be a logistic regression layer; and if we want the network to predict numerical values, then we can make its last layer a regression layer. We will see examples of these in Chapter 5.

其他流行的机器学习技术和技术组合

Other Popular Machine Learning Techniques and Ensembles of Techniques

回归和逻辑回归,重要的是扩展到机器学习社区并学习一些最流行的分类和回归任务技术背后的想法。支持向量机决策树随机森林非常强大和流行,并且能够执行分类和回归任务。那么自然的问题是,我们什么时候使用特定的机器学习方法,包括线性和逻辑回归,以及后来的神经网络?我们如何知道使用哪种方法并作为我们的结论和预测的基础?机器学习模型的数学分析可以帮助解决这些类型的问题。

After regression and logistic regression, it is important to branch out into the machine learning community and learn the ideas behind some of the most popular techniques for classification and regression tasks. Support vector machines, decision trees, and random forests are very powerful and popular, and are able to perform both classification and regression tasks. The natural question is then, when do we use a specific machine learning method, including linear and logistic regression, and later neural networks? How do we know which method to use and base our conclusions and predictions on? These are the types of questions where the mathematical analysis of the machine learning models helps.

由于每种方法的数学分析,包括它通常最适合的数据集类型,在最近人工智能、机器学习和数据科学研究的资源分配增加之后才得到认真关注,目前的做法是在同一数据集上尝试每种方法并使用效果最佳的方法。也就是说,假设我们拥有尝试不同机器学习技术所需的计算和时间资源。更好的是,如果您确实有时间和资源来训练各种机器学习模型(并行计算在这里是完美的),那么最好利用集成方法。这些结合了不同机器学习模型的结果,无论是通过平均还是通过投票,讽刺的是,但在数学上是合理的,给出了比最好的个人表现更好的结果,即使最好的表现者是表现不佳的人!

Since the mathematical analysis of each method, including the types of data sets it is usually best suited for, is only now gaining serious attention after the recent increase in resource allocation for research in AI, machine learning, and data science, the current practice is to try out each method on the same data set and use the one with the best results. That is, assuming we have the required computational and time resources to try out different machine learning techniques. Even better, if you do have the time and resources to train various machine learning models (parallel computing is perfect here), then it is good to utilize ensemble methods. These combine the results of different machine learning models, either by averaging or by voting, ironically, yet mathematically sound, giving better results than the best individual performers, and even when the best performers are weak performers!

集成的一个例子是随机森林:它是决策树的集合。

One example of an ensemble is a random forest: it is an ensemble of decision trees.

当我们的预测基于整体、行业术语时bagging(或bootstrap aggregating)、Pastingboosting(如ADA boostgradient boosting)、stacking随机 patch出现。装袋和粘贴在训练集的不同随机子集上训练相同的机器学习模型。对训练集中的样本实例进行装袋并进行替换,并从训练集中粘贴样本实例而不进行替换。也从特征空间中进行随机补丁采样,一次在特征的随机子集上训练机器学习模型。当数据集具有很多很多特征时,例如图像(其中每个像素都是一个特征),这非常有帮助。Stacking学习集成的预测机制,而不是简单的投票或平均。

When basing our predictions on ensembles, industry terms like bagging (or bootstrap aggregating), pasting, boosting (such as ADA boost and gradient boosting), stacking, and random patches appear. Bagging and pasting train the same machine learning model on different random subsets of the training set. Bagging samples instances from the training set with replacement, and pasting samples instances from the training set without replacement. Random patches sample from the feature space as well, training a machine learning model on a random subset of the features at a time. This is very helpful when the data set has many, many features, such as images (where each pixel is a feature). Stacking learns the prediction mechanism of the ensemble instead of simple voting or averaging.

支持向量机

Support Vector Machines

支持向量机是一种非常流行的机器学习方法,能够使用线性(平坦)和非线性(弯曲)决策边界执行分类和回归任务。

Support vector machine is an extremely popular machine learning method that’s able to perform classification and regression tasks with both linear (flat) and nonlinear (curved) decision boundaries.

对于分类,该方法寻求使用尽可能宽的边距来分离标记数据,从而形成最佳的分离高速公路,而不是细的分离线。让我们在本章的训练函数、损失函数和优化结构的背景下解释支持向量机如何对标记数据实例进行分类。

For classification, this method seeks to separate the labeled data using a widest possible margin, resulting in an optimal highway of separation as opposed to a thin line of separation. Let’s explain how support vector machines classify labeled data instances in the context of this chapter’s structure of training function, loss function, and optimization.

训练功能

Training function

再次我们线性组合具有未知权重的数据点的特征 ω 并添加偏差 ω 0 然后我们通过符号函数传递答案。如果特征加上偏差的线性组合为正数,则返回 1(或分类为第一类),如果为负数,则返回 –1(或分类为另一类)。因此训练函数的公式变为:

Once again we linearly combine the features of a data point with unknown weights ω ’s and add bias ω 0 . We then pass the answer through the sign function. If the linear combination of features plus bias is a positive number, return 1 (or classify in the first class), and if it is negative, return –1 (or classify in the other). So the formula for the training function becomes:

F ω ; X = s G n ω t X + ω 0

损失函数

Loss function

我们必须设计一个损失函数来惩罚错误分类的点。对于逻辑回归,我们使用交叉熵损失函数。对于支持向量机,我们的损失函数基于一个称为链损失函数

We must design a loss function that penalizes misclassified points. For logistic regression, we used the cross-entropy loss function. For support vector machines, our loss function is based on a function called the hinge loss function:

最大限度 0 , 1 - y tre ω t X + ω 0

让我们看看铰链损失函数如何惩罚分类错误。首先,请记住 y tre 是 1 或 –1,具体取决于数据点属于正类还是负类。

Let’s see how the hinge loss function penalizes errors in classification. First, recall that y true is either 1 or –1, depending on whether the data point belongs in the positive or the negative class.

  • 如果对于某个数据点 y tre 是 1 但是 ω t X + ω 0 < 0 ,训练函数会将其错误分类并给我们 y predCt = - 1 ,铰链损失函数的值将是 1 - 1 ω t X + ω 0 > 1 ,当您的目标是最小化时,这是一个很高的惩罚。

  • If for a certain data point y true is 1 but ω t x + ω 0 < 0 , the training function will misclassify it and give us y predict = - 1 , and the hinge loss function’s value will be 1 - ( 1 ) ( ω t x + ω 0 ) > 1 , which is a high penalty when your goal is to minimize.

  • 另一方面,如果 y tre 是 1 并且 ω t X + ω 0 > 0 ,训练函数将对其进行正确分类并给出 y predCt = 1 。然而,铰链损失函数的设计方式是,如果 ω t X + ω 0 < 1 ,其值将是 1 - 1 ω t X + ω 0 ,现在小于 1,但仍大于 0。

  • If, on the other hand, y true is 1 and ω t x + ω 0 > 0 , the training function will correctly classify it and give us y predict = 1 . The hinge loss function, however, is designed in such a way that it would still penalize us if ω t x + ω 0 < 1 , and its value will be 1 - ( 1 ) ( ω t x + ω 0 ) , which is now less than 1 but still bigger than 0.

  • 只有当 y tre 是 1 并且 ω t X + ω 0 > 1 (训练函数仍然会正确地对这一点进行分类并给出 y predCt = 1 ) 铰链损失函数值为 0,因为它将是 0 和负数之间的最大值。

  • Only when y true is 1 and ω t x + ω 0 > 1 (the training function will still correctly classify this point and give y predict = 1 ) will the hinge loss function value be 0, since it will be the maximum between 0 and a negative quantity.

  • 同样的逻辑适用于 y tre 是 –1。当铰链损失函数距离零分频器没有足够远的余量余量大于 1)时,铰链损失函数将对错误的预测进行大量惩罚,而对正确的预测进行少量惩罚。仅当预测正确并且该点与 0 分隔线的距离大于 1 时,铰链损失函数才会返回 0。

  • The same logic applies when y true is –1. The hinge loss function will penalize a lot for a wrong prediction, and a little for a right prediction when it doesn’t have far enough margin from the zero divider (a margin bigger than 1). The hinge loss function will return 0 only when the prediction is right and the point is at a distance larger than 1 from the 0 divider.

  • 请注意,0 除法器具有等式 ω t X + ω 0 = 0 ,边距边缘有方程 ω t X + ω 0 = - 1 ω t X + ω 0 = 1 。边距边缘之间的距离很容易计算为 2 ω 2 。所以如果我们想增加这个边距宽度,我们必须减少 ω 2 ; 因此,该项必须与铰链损失函数一起进入损失函数,铰链损失函数会惩罚错误分类的点和边缘边界内的点。

  • Note that the 0 divider has the equation ω t x + ω 0 = 0 , and the margin edges have equations ω t x + ω 0 = - 1 and ω t x + ω 0 = 1 . The distance between the margin edges is easy to calculate as 2 ω 2 . So if we want to increase this margin width, we have to decrease ω 2 ; thus, this term must enter the loss function, along with the hinge loss function, which penalizes both misclassified points and points within the margin boundaries.

现在,如果我们对训练集中所有m 个数据点的铰链损失进行平均,并添加 ω 2 2 ,我们得到支持向量常用的损失函数公式机器:

Now if we average the hinge loss over all the m data points in the training set, and add ω 2 2 , we obtain the formula for the loss function that is commonly used for support vector machines:

L ω = 1 Σ =1 最大限度 0 , 1 - y tre ω t X + ω 0 + λ ω 2 2

优化

Optimization

我们现在的目标就是寻找 w 最小化损失函数。让我们观察一下这个损失函数:

Our goal now is to search for the w that minimizes the loss function. Let’s observe this loss function for a minute:

  • 它有两个术语: 1 Σ =1 最大限度 0 , 1 - y tre ω t X + ω 0 λ ω 2 2 。每当我们在优化问题中使用多个项时,它们很可能是竞争项,即相同的项 ω 使第一个术语变小并因此感到高兴的价值观可能会使第二个术语变大并因此感到悲伤。因此,当我们搜索时,这是两个术语之间的推拉游戏 ω 从而优化它们的总和。

  • It has two terms: 1 m i=1 m max ( 0 , 1 - y true i ( ω t x i + ω 0 ) ) and λ ω 2 2 . Whenever we have more than one term in an optimization problem, it is most likely that they are competing terms, in the sense that the same ω values that make the first term small and thus happy might make the second term big and thus sad. So it is a push-and-pull game between the two terms as we search for the ω that optimizes their sum.

  • λ 出现的 λ ω 2 2 term 是我们可以在训练过程的验证阶段调整的模型超参数的示例。注意控制值 λ 帮助我们通过这种方式控制边距的宽度:如果我们选择一个大的 λ 值,优化器将忙于选择 ω 非常低 ω 2 2 ,以补偿大 λ ,并且损失函数的第一项将受到较少的关注。但请记住,较小的 ω 2 意味着更大的利润!

  • The λ that appears with the λ ω 2 2 term is an example of a model hyperparameter that we can tune during the validation stage of the training process. Note that controlling the value of λ helps us control the width of the margin this way: if we choose a large λ value, the optimizer will get busy choosing ω with very low ω 2 2 , to compensate for that large λ , and the first term of the loss function will get less attention. But recall that a smaller ω 2 means a larger margin!

  • λ ω 2 2 项也可以被认为是正则化项,我们将在第 4 章中讨论。

  • The λ ω 2 2 term can also be thought of as a regularization term, which we will discuss in Chapter 4.

  • 这个损失函数是凸函数并且下界为 0,因此它的最小化问题还算不错:我们不必担心陷入局部最小值。第一项有奇点,但如前所述,我们可以在奇点处定义其次梯度,然后应用下降法。

  • This loss function is convex and bounded below by 0, so its minimization problem is not too bad: we don’t have to worry about getting stuck at local minima. The first term has a singularity, but as mentioned before, we can define its subgradient at the singular point, then apply a descent method.

一些优化问题可以重新表述,我们最终不是解决原始的原始问题,而是解决它的对偶问题!通常,其中一个比另一个更容易解决。我们可以将对偶问题视为存在于原始问题的平行宇宙中的另一个优化问题。宇宙在优化器处相遇。因此,解决一个问题会自动给出另一个问题的解决方案。当我们研究优化时,我们研究对偶性。特别令人感兴趣和巨大的应用是线性和二次优化,也称为线性和二次规划。我们目前遇到的最小化问题:

Some optimization problems can be reformulated, and instead of solving the original primal problem, we end up solving its dual problem! Usually, one is easier to solve than the other. We can think of the dual problem as another optimization problem living in a parallel universe of the primal problem. The universes meet at the optimizer. Hence solving one problem automatically gives the solution of the other. We study duality when we study optimization. Of particular interest and huge application are linear and quadratic optimization, also known as linear and quadratic programming. The minimization problem that we currently have:

分钟 ω 1 Σ =1 最大限度 0 , 1 - y tre ω t X + ω 0 + λ ω 2 2

是二次规划的一个例子,它有一个对偶问题公式,结果比原始问题公式更容易优化(特别是当特征数量很多时):

is an example of quadratic programming, and it has a dual problem formulation that turns out to be easier to optimize than the primal (especially when the number of features is high):

最大限度 α Σ j=1 α j - 1 2 Σ j=1 Σ k=1 α j α k y tre j y tre k X j t X k

受约束 α j 0 Σ j=1 α j y tre j = 0 。当我们学习原始问题和对偶问题时,编写该公式通常很简单,因此我们跳过推导,以免中断我们的流程。

subject to the constraints α j 0 and j=1 m α j y true j = 0 . Writing that formula is usually straightforward when we learn about primal and dual problems, so we skip the derivation in favor of not interrupting our flow.

二次规划是一个非常发达的领域,有很多软件包可以解决这个问题。一旦我们找到最大化 α ,我们可以找到向量 ω 使用最小化原始问题 ω = Σ j=1 α j y tre X j 。一旦我们有了我们的 ω ,我们可以使用现在训练的函数对新数据点进行分类:

Quadratic programming is a very well-developed field, and there are many software packages that can solve this problem. Once we find the maximizing α , we can find the vector ω that minimizes the primal problem using ω = j=1 m α j y true i x j . Once we have our ω , we can classify new data points using our now trained function:

F X new = s G n ω t X new + ω 0 = s G n Σ j α j y X j t X new + ω 0

如果您想避免二次规划,还有另一种非常快速的方法,称为坐标下降,它可以解决对偶问题,并且可以很好地处理具有大量数据的大型数据集的功能。

If you want to avoid quadratic programming, there is another very fast method called coordinate descent that solves the dual problem and works very well with large data sets with a high number of features.

内核技巧

The kernel trick

我们可以现在将相同的想法过渡到非线性分类。让我们首先观察有关对偶问题的重要说明:数据点仅成对出现,更具体地说,仅以标量积出现,即 X j t X k 。同样,它们仅在训练函数中显示为标量积。这个简单的观察可以带来神奇的效果:

We can now transition the same ideas to nonlinear classification. Let’s first observe this important note about the dual problem: the data points appear only in pairs, more specifically, only in a scalar product, namely, (x j ) t x k . Similarly, they only appear as a scalar product in the trained function. This simple observation allows for magic:

  • 如果我们找到一个函数 K X j , X j 可以应用于数据点对,并且它恰好为我们提供了转换后的数据点对到某个更高维空间的标量积(不知道实际的转换是什么),然后我们可以解决同样精确的对偶问题通过将对偶问题公式中的标量积替换为 K X j , X j

  • If we find a function K ( x j , x j ) that can be applied to pairs of data points, and it happens to give us the scalar product of pairs of transformed data points into some higher dimensional space (without knowing what the actual transformation is), then we can solve the same exact dual problem in the higher dimensional space by replacing the scalar product in the formula of the dual problem with K ( x j , x j ) .

  • 这里的直觉是,在较低维度中非线性可分的数据在较高维度中几乎总是线性可分的。因此,将所有数据点变换到更高维度,然后分离。核技巧解决了更高维度的线性分类问题,而无需变换每个点。内核本身计算变换数据的点积,而不变换数据。很酷的东西。

  • The intuition here is that data that is nonlinearly separable in lower dimensions is almost always linearly separable in higher dimensions. So transform all the data points to higher dimensions, then separate. The kernel trick solves the linear classification problem in higher dimensions without transforming each point. The kernel itself evaluates the dot product of transformed data without transforming the data. Pretty cool stuff.

核函数的示例包括:

Examples of kernel functions include:

  • K X j , X j = X j t X j 2

  • K ( x j , x j ) = ((x j ) t x j ) 2

  • 多项式核: K X j , X j = 1+X j t X j d

  • The polynomial kernel: K ( x j , x j ) = (1+(x j ) t x j ) d

  • 高斯核: K X j , X j = e -γ|X j -X k | 2

  • The Gaussian kernel: K ( x j , x j ) = e -γ|x j -x k | 2

决策树

Decision Trees

住宿本章的主题是一切都是函数,决策树本质上是一个以布尔变量作为输入的函数(这些变量只能假设 true [或 1] 或 false [0 ]值)如:特征是否 > 5、特征是否 = sunny、特征是否 = man 等。它输出一个决策,如:批准贷款、分类为 covid19、返回 25 等。而不是添加或乘以布尔值变量,我们使用逻辑orandnot运算符。

Staying with our driving theme for this chapter that everything is a function, a decision tree, in essence, is a function that takes Boolean variables as an input (these are variables that can only assume true [or 1] or false [or 0] values) such as: is the feature > 5, is the feature = sunny, is the feature = man, etc. It outputs a decision such as: approve the loan, classify as covid19, return 25, etc. Instead of adding or multiplying boolean variables, we use the logical or, and, and not operators.

但是,如果我们的特征在原始数据集中没有作为布尔变量给出怎么办?然后我们必须将它们转换为布尔变量,然后再将它们输入模型进行预测。例如,图 3-17中的决策树是在 Fish Market 数据集上进行训练的。它是一棵回归树。树获取原始数据,但表示树的函数实际上对新变量进行操作,这些新变量是将原始数据特征转换为布尔变量:

But what if our features are not given in the original data set as Boolean variables? Then we must transform them to Boolean variables before feeding them into the model to make predictions. For example, the decision tree in Figure 3-17 was trained on the Fish Market data set. It is a regression tree. The tree takes raw data, but the function representing the tree actually operates on new variables, which are the original data features transformed into Boolean variables:

  1. a1 =(宽度 5.117)

  2. a1 = (Width 5.117)

  3. a2 = (长度3 59.55)

  4. a2 = (Length3 59.55)

  5. a3 = (长度3 41.1)

  6. a3 = (Length3 41.1)

  7. a4 = (长度3 34.9)

  8. a4 = (Length3 34.9)

  9. a5 = (长度3 27.95)

  10. a5 = (Length3 27.95)

  11. a6 = (长度3 21.25)

  12. a6 = (Length3 21.25)

埃迈 0317
图 3-17。基于 Fish Market 数据集构建的回归决策树。有关详细信息,请参阅随附的 Jupyter 笔记本。

现在代表图 3-17中决策树的函数是:

Now the function representing the decision tree in Figure 3-17 is:

F A 1 , A 2 , A 3 , A 4 , A 5 , A 6 = A 1 A 5 A 6 × 39 第584章 + A 1 A 5 不是 A 6 × 139 968 + A 1 不是 A 5 A 4 × 第287章 278 + A 1 不是 A 5 不是 A 4 × 第422章 第769章 + 不是 A 1 A 2 A 3 × 第639章 第737章 + 不是 A 1 A 2 不是 A 3 × 第824章 211 + 不是 A 1 不是 A 2 × 1600

请注意,与本章到目前为止我们遇到的训练函数不同,该函数没有参数 ω 是我们需要解决的问题。这就是所谓的非参数模型,并且它不会提前修复函数的形状。这使其能够灵活地随数据增长,或者换句话说,适应数据。当然,这种对数据的高度适应性也带来了过度拟合数据的高风险。值得庆幸的是,有一些方法可以解决这个问题,我们在这里列出了一些方法,没有任何详细说明:在生长树后修剪树,限制层数,设置每个节点的最小数据实例数,或者使用树的集合而不是一棵树,称为随机森林,稍后讨论。

Note that unlike the training functions that we’ve encountered in this chapter so far, this function has no parameters ω ’s that we need to solve for. This is called a nonparametric model, and it doesn’t fix the shape of the function ahead of time. This gives it the flexibility to grow with the data, or in other words, adapt to the data. Of course, with this high adaptability to the data comes the high risk of overfitting the data. Thankfully there are ways around this, some of which we list here without any elaboration: pruning the tree after growing it, restricting the number of layers, setting a minimum number of data instances per node, or using an ensemble of trees instead of one tree, called a random forest, discussed later.

一个非常重要的观察结果:决策树决定仅分割原始数据集的两个特征,即 Width 和 Length3 特征。决策树的设计目的是使更重要的特征(那些提供对我们的预测有贡献的最多信息的特征)更接近根。因此,决策树可以帮助进行特征选择,我们可以选择最重要的特征来促进最终模型的预测。

One very important observation: the decision tree decided to split over only two features of the original data set, namely the Width and Length3 features. Decision trees are designed to keep the more important features (those providing the most information that contribute to our prediction) closer to the root. Therefore, decision trees can help in feature selection, where we select the most important features to contribute to our final model’s predictions.

难怪宽度和长度 3 特征最终成为预测鱼的重量最重要的特征。图3-18中的相关矩阵和图3-3中的散点图显示了所有长度特征之间极强的相关性。这意味着它们提供的信息是多余的,并且将它们全部包含在我们的预测模型中将增加计算成本并降低性能。

It is no wonder that the Width and Length3 features ended up being the most important for predicting the weight of the fish. The correlation matrix in Figure 3-18 and the scatterplots in Figure 3-3 show extremely strong correlation between all the length features. This means that the information they provide is redundant, and including all of them in our prediction models will increase computation costs and lower performance.

埃麦0318
图 3-18。Fish Market 数据集的相关矩阵。所有长度特征之间都存在极强的相关性。

特征选择

Feature Selection

我们刚刚介绍了特征选择非常重要的话题。现实世界的数据集具有许多特征,其中一些可能提供冗余信息,而另一些对于预测我们的目标标签根本不重要。在机器学习模型中包含不相关和冗余的特征会增加计算成本并降低其性能。我们刚刚看到决策树是帮助选择重要特征的一种方法。另一种方法是称为正则化技术lasso回归,我们将在第4章中介绍。有一些统计测试可以测试功能之间的依赖性。F检验_对于线性依赖性(这为相关特征提供了更高的分数,但相关性本身是具有欺骗性的),以及互信息测试对于非线性依赖性。这些提供了一个特征对确定目标标签的贡献程度的衡量标准,因此通过保留最有希望的特征来帮助特征选择。我们还可以测试特征之间的依赖关系,以及它们的相关性和散点图。方差阈值去除特征几乎没有变化,前提是如果一个特征本身变化不大,那么它的预测能力就很小。

We just introduced the very important topic of feature selection. Real-world data sets come with many features, and some of them may provide redundant information, while others are not important at all for predicting our target label. Including irrelevant and redundant features in a machine learning model increases computational cost and lowers its performance. We just saw that decision trees are one way to help select the important features. Another way is a regularization technique called lasso regression, which we will introduce in Chapter 4. There are statistical tests that test for feature dependencies on each other. The F-test tests for linear dependencies (this gives higher scores for correlated features, but correlations alone are deceptive), and mutual information tests for nonlinear dependencies. These provide a measure of how much a feature contributes to determining the target label, and hence aid in feature selection by keeping the most promising features. We can also test for feature dependencies on each other, as well as their correlations and scatterplots. Variance thresholding removes features with little to no variance, on the premise that if a feature does not vary much within itself, it has little predictive power.

我们如何在数据集上训练决策树?我们优化什么功能?在生成决策树时,通常会优化两个函数:熵和基尼杂质。使用其中之一对生成的树没有太大影响。我们开发接下来这些。

How do we train a decision tree on a data set? What function do we optimize? There are two functions that are usually optimized when growing decision trees: the entropy and the Gini impurity. Using one or the other doesn’t make much difference in the resultant trees. We develop these next.

熵和基尼杂质

Entropy and Gini impurity

在这里,我们决定在被评估为最重要的特征上分割树的节点。熵和基尼杂质是衡量特征重要性的两种流行方法。它们在数学上并不等效,但它们都能工作并提供合理的决策树。基尼杂质的计算成本通常较低,因此它是软件包中的默认设置,但您可以选择从默认设置更改为熵。当某些类别的频率比其他类别高得多时,使用基尼杂质往往会产生不太平衡的树。这些类最终被隔离在自己的分支中。然而,在许多情况下,使用熵或基尼杂质并不能在生成的决策树中提供太大的差异。

Here, we decide to split a node of the tree on the feature that is evaluated as the most important. Entropy and Gini impurity are two popular ways to measure importance of a feature. They are not mathematically equivalent, but they both work and provide reasonable decision trees. Gini impurity is usually less expensive to compute, so it is the default in software packages, but you have the option to change from the default setting to entropy. Using Gini impurity tends to produce less-balanced trees when there are classes with much higher frequency than others. These classes end up isolated in their own branches. However, in many cases, using either entropy or Gini impurity does not provide much difference in the resulting decision trees.

通过方法,我们寻找提供最大信息增益的特征分割(我们将很快给出其公式)。信息增益借用自信息论,它与熵的概念有关。熵又是从热力学和统计物理学借用的,它量化了某个系统中的无序程度。

With the entropy approach, we look for the feature split that provides the maximal information gain (we’ll give its formula shortly). Information gain is borrowed from information theory, and it has to do with the concept of entropy. Entropy, in turn, is borrowed from thermodynamics and statistical physics, and it quantifies the amount of disorder in a certain system.

通过基尼不纯度方法,我们寻找为子节点提供最低平均基尼不纯度的特征分割(我们很快也会给出其公式)。

With the Gini impurity approach, we look for the feature split that provides children nodes with lowest average Gini impurity (we’ll also give its formula shortly).

为了最大化信息增益(或最小化基尼不纯度),生成决策树的算法必须遍历训练数据子集的每个特征并计算信息增益(或基尼不纯度),无论树是否使用该特定特征作为要分割的节点,然后选择提供最高信息增益的特征(或具有最低平均基尼杂质的子节点)。此外,如果特征具有真实的数值,算法必须决定在节点 提出什么问题,这意味着要分割什么特征值;例如,是 X 5 < 0 1 ?该算法必须在树的每一层按顺序执行此操作,计算每个节点中数据实例的特征的信息增益(或基尼杂质),有时还计算每个分割值的可能性。通过示例更容易理解这一点。但首先,我们写出熵、信息增益和基尼杂质。

To maximize information gain (or minimize Gini impurity), the algorithm that grows a decision tree has to go over each feature of the training data subset and calculate the information gain (or Gini impurity), accomplished whether the tree uses that particular feature as a node to split on, then choose the feature that provides the highest information gain (or children nodes with lowest average Gini impurity). Moreover, if the feature has real numerical values, the algorithm has to decide on what question to ask at the node, meaning what feature value to split on; for example, is x 5 < 0 . 1 ? The algorithm has to do this sequentially at each layer of the tree, calculating information gain (or Gini impurities) over the features of the data instances in each node, and sometimes on each split value possibility. This is easier to understand with examples. But first, we write the formulas for the entropy, information gain, and Gini impurity.

熵和信息增益

Entropy and information gain

理解熵公式的最简单方法是依靠直觉:如果一个事件的可能性很大,那么它的发生就不会令人感到意外。所以当 p e v e n t 越大,其惊喜度就越低。我们可以用一个函数对其进行数学编码,该函数随着概率的增加而减小。微积分函数 日志 1 X 有效,并且具有独立事件的意外加起来的额外属性。因此,我们可以定义:

The easiest way to understand the entropy formula is to rely on the intuition that if an event is highly probable, then there is little surprise associated with it happening. So when p ( e v e n t ) is large, its surprise is low. We can mathematically encode this with a function that decreases when its probability increases. The calculus function log 1 x works and has the additional property that the surprises of independent events add up. Therefore, we can define:

S r p r s e e v e n t = 日志 1 pevent = - 日志 p e v e n t

现在,随机变量的熵(在我们的例子中是训练数据集中的一个特定特征)被定义为与随机变量相关的预期惊喜,因此我们必须将随机变量的每个可能结果的惊喜相加(相关特征的每个值的惊喜)乘以它们各自的概率,得到:

Now the entropy of a random variable (which in our case is a particular feature in our training data set) is defined as the expected surprise associated with the random variable, so we must add up the surprises of each possible outcome of the random variable (surprise of each value of the feature in question) multiplied by their respective probabilities, obtaining:

n t r p y X = - p t C e 1 日志 p t C e 1 - p t C e 2 日志 p t C e 2 - - p t C e n 日志 p t C e n

假设一组值的训练数据的一个特征的熵是:

The entropy for one feature of our training data that assumes a bunch of values is:

n t r p y F e A t r e = - p v A e 1 日志 p v A e 1 - p v A e 2 日志 p v A e 2 - - p v A e n 日志 p v A e n

由于我们的目标是选择在提供有关结果的大量信息增益(标签或目标特征)的特征上进行分割,因此我们首先计算结果特征的熵。

Since our goal is to select to split on a feature that provides large information gain about the outcome (the label or the target feature), let’s first calculate the entropy of the outcome feature.

二进制输出

Binary output

为简单起见,假设这是一个二元分类问题,因此结果特征只有两个值:正(在类中)和负(不在类中)。

Assume for simplicity that this is a binary classification problem, so the outcome feature only has two values: positive (in the class) and negative (not in the class).

如果我们令p为目标特征中正实例的数量,n为负实例的数量,则p + n = m将是训练数据子集中的实例数量。现在从该目标列中选择正实例的概率为 p = p p+n ,选择负实例的概率类似 n = n p+n

If we let p be the number of positive instances in the target feature and n be the number of negative ones, then p + n = m will be the number of instances in the training data subset. Now the probability to select a positive instance from that target column will be p m = p p+n , and the probability to select a negative instance is similarly n m = n p+n .

因此,结果特征的熵(不利用其他特征的任何信息)为:

Thus, the entropy of the outcome feature (without leveraging any information from the other features) is:

n t r p y 结果 特征 = - p p s t v e 日志 p p s t v e - p n e G A t v e 日志 p n e G A t v e = - p p+n 日志 p p+n - n p+n 日志 n p+n

接下来,我们利用来自另一个特征的信息并计算结果特征的熵差,我们预计随着我们获得更多信息,熵差会减少(更多信息通常会导致更少的意外)。

Next, we leverage information from one other feature and calculate the difference in entropy of the outcome feature, which we expect to decrease as we gain more information (more information generally results in less surprise).

假设我们选择特征 A 来分割决策树的一个节点。假设特征 A 假设有四个值,并且有 k 1 实例与 v A e 1 , 这些 p 1 被标记为积极的目标结果,并且 n 1 被标记为负面作为他们的目标结果,所以 p 1 + n 1 = k 1 。类似地,特征 A 有 k 2 实例与 v A e 2 , 这些 p 2 被标记为积极的目标结果,并且 n 2 被标记为负面作为他们的目标结果,所以 p 2 + n 2 = k 2 。这同样适用于 v A e 3 v A e 4 特征 A 的特征。请注意 k 1 + k 2 + k 3 + k 4 = ,数据集训练子集中的实例总数。

Suppose we choose Feature A to split a node of our decision tree on. Suppose Feature A assumes four values, and has k 1 instances with v a l u e 1 , of these p 1 are labeled positive as their target outcome, and n 1 are labeled negative as their target outcome, so p 1 + n 1 = k 1 . Similarly, Feature A has k 2 instances with v a l u e 2 , of these p 2 are labeled positive as their target outcome, and n 2 are labeled negative as their target outcome, so p 2 + n 2 = k 2 . The same applies for v a l u e 3 and v a l u e 4 of Feature A. Note that k 1 + k 2 + k 3 + k 4 = m , the total number of instances in the training subset of the data set.

我们可以考虑每个值 v A e k 就其本身而言,将特征 A 作为随机变量,其中 p k 积极成果和 n k 负面结果,所以我们可以计算它的熵(预期的惊喜):

We can think of each value v a l u e k of Feature A as a random variable in its own respect, with p k positive outcomes and n k negative outcomes, so we can calculate its entropy (expected surprise):

v A e 1 = - p 1 p 1 +n 1 日志 p 1 p 1 +n 1 - n 1 p 1 +n 1 日志 n 1 p 1 +n 1 v A e 2 = - p 2 p 2 +n 2 日志 p 2 p 2 +n 2 - n 2 p 2 +n 2 日志 n 2 p 2 +n 2 v A e 3 = - p 3 p 3 +n 3 日志 p 3 p 3 +n 3 - n 3 p 3 +n 3 日志 n 3 p 3 +n 3 v A e 4 = - p 4 p 4 +n 4 日志 p 4 p 4 +n 4 - n 4 p 4 +n 4 日志 n 4 p 4 +n 4

现在我们有了这些信息,我们可以通过添加刚才提到的四个熵(每个熵乘以各自的概率)来计算特征 A 分割后的预期熵 p v A e 1 = k 1 , p v A e 2 = k 2 , p v A e 3 = k 3 , 和 p v A e 4 = k 4

Now that we have this information, we can calculate the expected entropy after splitting on Feature A, by adding the four entropies just mentioned, each multiplied by its respective probability: p ( v a l u e 1 ) = k 1 m , p ( v a l u e 2 ) = k 2 m , p ( v a l u e 3 ) = k 3 m , and p ( v a l u e 4 ) = k 4 m .

因此,特征 A 分裂后的预期熵为:

Therefore, the expected entropy after splitting on Feature A would be:

预期的 特征 A = p v A e 1 v A e 1 + p v A e 2 v A e 2 + p v A e 3 v A e 3 + p v A e 4 v A e 4 = k 1 v A e 1 + k 2 v A e 2 + k 3 v A e 3 + k 4 v A e 4

那么使用特征A进行分裂会得到什么信息呢?这将是没有来自特征 A 的任何信息的结果特征的熵与特征 A 的预期熵之间的差异。也就是说,考虑到我们决定在特征 A 上进行分割,我们有一个信息增益的公式:

So what would be the information gained from using Feature A to split on? It would be the difference between the entropy of the outcome feature without any information from Feature A and the expected entropy of Feature A. That is, we have a formula for information gain given that we decide to split on Feature A:

信息 获得 = n t r p y 结果 特征 - 预期的 特征 A = - p p+n 日志 p p+n - n p+n 日志 n p+n - 预期的 特征 A

现在,可以轻松浏览训练数据子集的每个特征,并计算使用该特征进行分割所产生的信息增益。最终,决策树算法决定对信息增益最高的特征进行分割。该算法对每个节点和树的每一层递归地执行此操作,直到用完要分割的特征或数据实例。这就是我们获得基于熵的决策树的方式。

Now it is easy to go through each feature of the training data subset and calculate the information gain resulting from using that feature to split on. Ultimately, the decision tree algorithm decides to split on the feature with highest information gain. The algorithm does this recursively for each node and at each layer of the tree, until it runs out of features to split on or data instances. This is how we obtain our entropy-based decision tree.

多级输出

Multi-class output

将这种逻辑推广到我们具有多类输出的情况并不太困难,例如具有三个或更多目标标签的分类问题。UCI 机器学习存储库中的经典Iris 数据集就是一个很好的例子,具有三个目标标签。该数据集具有给定鸢尾花的四个特征:萼片长度和宽度以及花瓣长度和宽度。请注意,这些特征中的每一个都是连续随机变量,而不是离散变量。因此,在应用前面提到的逻辑之前,我们必须设计一个测试来分割每个特征的值。这是数据科学项目特征工程阶段的一部分。这里的工程步骤是:将连续值特征转换为布尔特征;例如,花瓣长度是否 > 2.45?我们不会讨论如何选择数字 2.45,但现在您可能可以猜到有一个优化过程应该也到这里继续吧。

It is not too difficult to generalize this logic to the case where we have a multiclass output, for example, a classification problem with three or more target labels. The classical Iris data set from the UCI Machine Learning Repository is a great example with three target labels. This data set has four features for a given Iris flower: its sepal length and width, and its petal length and width. Note that each of these features is a continuous random variable, not discrete. So we have to devise a test to split on the values of each feature, before applying the previously mentioned logic. This is part of the feature engineering stage of a data science project. The engineering step here is: transform a continuous valued feature into a Boolean feature; for example, is the petal length > 2.45? We will not go over how to choose the number 2.45, but by now you probably can guess that there is an optimization process that should go on here as well.

基尼杂质

Gini impurity

每个决策树由节点、分支和叶子来表征。如果一个节点仅包含训练数据子集中具有相同目标标签的数据实例(这意味着它们属于同一类),则该节点被视为纯节点请注意,纯节点是所需的节点,因为我们知道它的类。因此,算法希望以最小化节点不纯性的方式生长树:如果节点中的数据实例不全部属于同一类,则该节点是不纯。基尼杂质通过以下方式量化这种杂质。

Each decision tree is characterized by its nodes, branches, and leaves. A node is considered pure if it only contains data instances from the training data subset that have the same target label (this means they belong to the same class). Note that a pure node is a desired node, since we know its class. Therefore, an algorithm would want to grow a tree in a way that minimizes the impurity of the nodes: if the data instances in a node do not all belong in the same class, then the node is impure. Gini impurity quantifies this impurity the following way.

假设我们的分类问题具有三个类别,例如Iris 数据集还假设决策树中的某个节点成长为适合该数据集,有n 个训练实例,其中 n 1 其中属于第一类, n 2 在第二堂课中,并且 n 3 在第三类(所以 n 1 + n 2 + n 3 = n )。那么该节点的基尼杂质由下式给出

Suppose that our classification problem has three classes, like the Iris data set. Suppose also that a certain node in a decision tree grown to fit this data set has n training instances, with n 1 of these belonging in the first class, n 2 in the second class, and n 3 in the third class (so n 1 + n 2 + n 3 = n ). Then the Gini impurity of this node is given by:

基尼系数 不纯 = 1 - n 1 n 2 - n 2 n 2 - n 3 n 2

因此,对于每个节点,计算属于每个类的数据实例的分数,平方,然后从 1 中减去这些数据实例的总和。请注意,如果节点的所有数据实例属于同一类,则该公式给出基尼杂质等于 0。

So for each node, the fraction of the data instances belonging to each class is calculated, squared, then the sum of those is subtracted from 1. Note that if all the data instances of a node belong in the same class, then that formula gives a Gini impurity equal to 0.

决策树生长算法现在寻找特征和每个特征中的分割点,以平均产生具有最低基尼杂质的子节点。这意味着节点的子节点平均必须比父节点更纯净。因此,该算法尝试最小化(二叉树的)两个子节点的基尼杂质的加权平均值。每个子级的基尼不纯度按其相对大小进行加权,相对大小是其实例数相对于该树层中的实例总数(与其父级的实例数相同)之间的比率。因此,我们最终必须搜索解决以下最小化问题的特征和分割点(对于每个特征)组合:

The decision tree growing algorithm now looks for the feature and split point in each feature that produce children nodes with the lowest Gini impurity, on average. This means the children of a node must on average be purer than the parent node. Thus, the algorithm tries to minimize a weighted average of the Gini impurities of two of the children nodes (of a binary tree). Each child’s Gini impurity is weighted by its relative size, which is the ratio between its number of instances relative to the total number of instances in that tree layer (which is the same as the number of instances as its parent’s). Thus, we end up having to search for the feature and the split point (for each feature) combination that solve the following minimization problem:

分钟 FeAtre,FeAtreSptVAe n eFt n G n 左边 节点 + n rGHt n G n 正确的 节点

在哪里 n eFt n rGHt 是最终位于左子节点和右子节点的数据实例的数量,n是位于父节点的数据实例的数量(请注意, n eFt n rGHt 加起来必须达到n )。

where n left and n right are the number of data instances that end up being in the left and right children nodes, and n is the number of data instances that are in the parent node (note that n left and n right must add up to n).

回归决策树

Regression decision trees

这很重要指出决策树可用于回归和分类。回归决策树返回预测值而不是类别,但适用与分类树类似的过程。

It is important to point out that decision trees can be used for both regression and classification. A regression decision tree returns a predicted value rather than a class, but a similar process to a classification tree applies.

我们不是通过选择一个特征和一个特征值(例如,高度 > 3 英尺?)来最大化信息增益或最小化基尼不纯度来分割节点,而是选择一个特征和一个特征值来最小化真实节点之间的均方距离。标签和每个左子节点和右子节点中所有实例的标签的平均值。也就是说,该算法选择要分割的特征和特征值,然后查看该分割产生的左子节点和右子节点,并计算:

Instead of splitting a node by selecting a feature and a feature value (for example, is height > 3 feet?) that maximize information gain or minimize Gini impurity, we select a feature and a feature value that minimize a mean squared distance between the true labels and the average of the labels of all the instances in each of the left and right children nodes. That is, the algorithm chooses a feature and feature value to split on, then looks at the left and right children nodes resulting from that split, and calculates:

  • 左节点中训练数据实例的所有标签的平均值。该平均值将是左节点值 y eFt ,如果该节点最终成为叶节点,则决策树预测的值。

  • The average value of all the labels of the training data instances in the left node. This average will be the left node value y left , and is the value predicted by the decision tree if this node ends up being a leaf node.

  • 右侧节点中训练数据实例的所有标签的平均值。该平均值将是正确的节点值 y rGHt 。类似地,如果该节点最终成为叶节点,则这是决策树预测的值。

  • The average value of all the labels of the training data instances in the right node. This average will be the right node value y right . Similarly, this is the value predicted by the decision tree if this node ends up being a leaf node.

  • 左节点值与左节点中每个实例的真实标签之间的距离平方和 Σ LeFtdenstAnCes |y tre -y eFt | 2

  • The sum of the squared distance between the left node value and the true label of each instance in the left node LeftNodeInstances |y true i -y left | 2 .

  • 右节点值与右节点中每个实例的真实标签之间的距离平方和 Σ GHtdenstAnCes |y tre -y rGHt | 2

  • The sum of the squared distance between the right node value and the true label of each instance in the right node RightNodeInstances |y true i -y right | 2 .

  • 刚才提到的两个总和的加权平均值,其中每个节点根据其相对于父节点的大小进行加权,就像我们对基尼杂质所做的那样:

  • A weighted average of the just-mentioned two sums, where each node is weighted by its size relative to the parent node, just like we did for the Gini impurity:

n eFt n Σ LeFtdenstAnCes | y tre - y eFt | 2 + n rGHt n Σ GHtdenstAnCes | y tre - y rGHt | 2

该算法是贪婪且计算量大的,因为它必须对每个特征和每个可能的特征分割值执行此操作,然后选择在左右子节点之间提供最小加权平方误差平均值的特征和特征分割。

That algorithm is greedy and computation-heavy, in the sense that it has to do this for each feature and each possible feature split value, then choose the feature and feature split that provide the smallest weighted squared error average between the left and right children nodes.

软件包使用了著名的 CART(分类和回归树)算法,包括 Python 的 scikit-learn,我们在本书的 Jupyter 笔记本中使用了该算法。该算法生成的树的节点只有两个子节点(二叉树),每个节点的测试只有“是”或“否”答案。其他算法(例如 ID3)可以生成具有两个或更多子节点的节点的树。

The famous CART (Classification and Regression Tree) algorithm is used by software packages, including Python’s scikit-learn, which we use in the Jupyter notebooks supplementing this book. This algorithm produces trees with nodes that only have two children (binary trees), where the test at each node only has Yes or No answers. Other algorithms such as ID3 can produce trees with nodes that have two or more children.

决策树的缺点

Shortcomings of decision trees

决策树它们非常容易解释,并且由于许多充分的原因而受到欢迎:它们适应大型数据集、不同的数据类型(离散和连续特征、无需缩放数据),并且可以执行回归和分类任务。然而,它们可能不稳定,因为仅向数据集中添加一个实例就可以改变树的根部,从而导致生成一棵非常不同的决策树。它们对数据的旋转也很敏感,因为它们的决策边界通常是水平和垂直的(不像支持向量机那样倾斜)。这是因为分割通常发生在特定的特征值处,因此决策边界最终与特征轴平行。一种解决方法是使用第 6 章中介绍的奇异值分解方法来转换数据集以匹配其主轴。决策树往往会过度拟合数据,因此需要修剪。这通常是使用统计测试来完成的。构建树时涉及的贪婪算法,其中搜索发生在所有特征及其值上,这使得它们的计算成本昂贵且不太准确。接下来讨论的随机森林解决了其中一些缺点。

Decision trees are very easy to interpret and are popular for many good reasons: they adapt to large data sets, different data types (discrete and continuous features, no scaling of data needed), and can perform both regression and classification tasks. However, they can be unstable, in the sense that adding just one instance to the data set can change the tree at its root and hence result in a very different decision tree. They are also sensitive to rotations in the data, since their decision boundaries are usually horizontal and vertical (not slanted like support vector machines). This is because splits usually happen at specific feature values, so the decision boundaries end up parallel to the feature axes. One fix is to transform the data set to match its principal axes, using the singular value decomposition method presented in Chapter 6. Decision trees tend to overfit the data, so they need pruning. This is usually done using statistical tests. The greedy algorithms involved in constructing the trees, where the search happens over all features and their values, makes them computationally expensive and less accurate. Random forests, discussed next, address some of these shortcomings.

随机森林

Random Forests

当我第一次了解决策树后,我最困惑的方面是:

When I first learned about decision trees, the most perplexing aspects for me were:

  • 我们如何开始这棵树,也就是说我们如何决定哪个数据特征是根特征?

  • How do we start the tree, meaning how do we decide which data feature is the root feature?

  • 我们决定在什么特定特征值下分割节点?

  • At what particular feature value do we decide to split a node?

  • 我们什么时候停下来?

  • When do we stop?

  • 本质上,我们如何种植一棵树?

  • In essence, how do we grow a tree?

(请注意,我们在上一小节中回答了其中一些问题。)这并没有让事情变得更容易,我在互联网上寻找答案,却遇到了决策树如此容易构建和理解的声明,所以它感觉就像我是唯一一个被决策树深深困惑的人。

(Note that we answered some of these questions in the previous subsection.) It didn’t make matters any easier that I would surf the internet looking for answers, only to encounter declarations that decision trees are so easy to build and understand, so it felt like I was the only one deeply confused by decision trees.

当我了解随机森林时,我的困惑立即消失了。随机森林的神奇之处在于,我们可以获得非常好的回归或分类结果,而无需回答我任何令人困惑的问题。整个过程随机化意味着构建许多决策树,同时用两个词回答我所有的问题:随机选择。然后将预测聚合到一个集合中会产生非常好的结果,甚至比精心设计的决策树更好。有人说,随机化往往会产生可靠性

My puzzlement instantly disappeared when I learned about random forests. The amazing thing about random forests is that we can get incredibly good regression or classification results without answering any of my bewildering questions. Randomizing the whole process means building many decision trees while answering all my questions with two words: choose randomly. Then aggregating the predictions in an ensemble produces very good results, even better than one carefully crafted decision tree. It has been said that randomization often produces reliability!

随机森林的另一个非常有用的属性是它们给出了特征重要性的度量,帮助我们查明哪些特征显着影响我们的预测,并有助于特征选择。

Another very useful property of random forests is that they give a measure of feature importance, helping us pinpoint which features significantly affect our predictions, and aid in feature selection as well.

k-均值聚类

k-means Clustering

一个共同的目标数据分析师的职责是将数据划分为集群,每个集群突出某些共同特征。k 均值聚类是一种常见的机器学习方法,它将n 个数据点(向量)划分为k 个簇,其中每个数据点被分配到具有最接近均值的簇。每个簇的平均值或其质心充当簇的原型。总体而言,k 均值聚类可最大限度地减少每个聚类内的方差(到平均值的欧氏距离的平方)。

One common goal of data analysts is to partition data into clusters, each cluster highlighting certain common traits. k-means clustering is a common machine learning method that partitions n data points (vectors) into k clusters, where each data point gets assigned to the cluster with the nearest mean. The mean of each cluster, or its centroid, serves as the prototype of the cluster. Overall, k-means clustering minimizes the variance (the squared Euclidean distances to the mean) within each cluster.

k 均值聚类最常见的算法是迭代算法:

The most common algorithm for k-means clustering is iterative:

  1. 从一组初始的k均值开始。这意味着我们提前指定了簇的数量,这就提出了一个问题:如何初始化它?如何选择前k个质心的位置?有这方面的文献。

  2. Start with an initial set of k means. This means that we specify the number of clusters ahead of time, which raises the question: How to initialize it? How to select the locations of the first k centroids? There is literature on that.

  3. 将每个数据点分配给具有最接近欧氏距离平方平均值的簇。

  4. Assign each data point to the cluster with the nearest mean in terms of squared Euclidean distance.

  5. 重新计算每个簇的平均值。

  6. Recalculate the means of each cluster.

当分配给每个簇的数据点不再改变时,算法收敛。

The algorithm converges when the data point assignments to each cluster do not change anymore.

分类模型的性能测量

Performance Measures for Classification Models

这是开发计算事物并产生输出的数学模型相对容易。开发能够很好地完成我们所需任务的模型是一个完全不同的故事。此外,根据某些指标表现良好的模型根据其他指标表现不佳。根据我们的具体用例,我们在制定性能指标并决定依赖哪些指标时需要格外小心。

It is relatively easy to develop mathematical models that compute things and produce outputs. It is a completely different story to develop models that perform well for our desired tasks. Furthermore, models that perform well according to some metrics behave badly according to some other metrics. We need extra care in developing performance metrics and deciding which ones to rely on, depending on our specific use cases.

测量预测数值的模型(例如回归模型)的性能比分类模型更容易,因为我们有很多方法来计算数字之间的距离(好的预测和坏的预测)。另一方面,当我们的任务是分类时(我们可以使用逻辑回归、softmax回归、支持向量机、决策树、随机森林或神经网络等模型),我们必须额外考虑评估性能。此外,通常需要进行权衡。例如,如果我们的任务是将 YouTube 视频分类为对儿童安全(正面)或对儿童不安全(负面),我们是否会调整模型以减少误报或漏报的数量?如果视频被分类为安全,而实际上它是不安全的(误报),那么显然比相反的情况更成问题,因此我们的性能指标需要反映这一点。

Measuring the performance of models that predict numerical values, such as regression models, is easier than classification models, since we have many ways to compute distances between numbers (good predictions and bad predictions). On the other hand, when our task is classification (we can use models such as logistic regression, softmax regression, support vector machines, decision trees, random forests, or neural networks), we have to put some extra thought into evaluating performance. Moreover, there are usually trade-offs. For example, if our task is to classify YouTube videos as being safe for kids (positive) or not safe for kids (negative), do we tweak our model so as to reduce the number of false positives or false negatives? It is obviously more problematic if a video is classified as safe while in reality it is unsafe (false positive) than the other way around, so our performance metric needs to reflect that.

以下是分类模型常用的性能度量。不要担心记住他们的名字,因为他们的命名方式没有逻辑意义。相反,花时间理解它们的含义:

The following are the performance measures commonly used for classification models. Do not worry about memorizing their names, as the way they are named do not make logical sense. Instead, spend your time understanding their meanings:

准确性
Accuracy

预测模型正确分类的次数百分比:

A C C r A C y = 真的积极的一面+真的底片 全部预料到的积极的一面+全部预料到的底片

Percentage of times the prediction model got the classification right:

A c c u r a c y = truepositives+truenegatives allpredictedpositives+allpredictednegatives
混淆矩阵
Confusion matrix

计算所有真阳性、假阳性、真阴性和假阴性:

真阴性

假阳性

假阴性

真阳性

Counting all true positives, false positives, true negatives, and false negatives:

True negative

False positive

False negative

True positive

精度分数
Precision score

积极预测的准确性:

r e C s n = 真的积极的一面 全部预料到的积极的一面 = 真的积极的一面 真的积极的一面+错误的积极的一面

Accuracy of the positive predictions:

P r e c i s i o n = truepositives allpredictedpositives = truepositives truepositives+falsepositives
回忆分数
Recall score

正确分类的正例比例:

e C A = 真的积极的一面 全部积极的标签 = 真的积极的一面 真的积极的一面+错误的底片

Ratio of the positive instances that are correctly classified:

R e c a l l = truepositives allpositivelabels = truepositives truepositives+falsenegatives
特异性
Specificity

正确分类的负实例比例:

S p e C F C t y = 真的底片 全部消极的标签 = 真的底片 真的底片+错误的积极的一面

Ratio of the negative instances that are correctly classified:

S p e c i f i c i t y = truenegatives allnegativelabels = truenegatives truenegatives+falsepositives
F 1 分数
F 1 score

仅当精确率和召回率分数都很高时,这个数量才会很高:

F 1 = 2 1 preCsn+1 reCA

This quantity is only high when both precision and recall scores are high:

F 1 = 2 1 precision+1 recall
AUC(曲线下面积)和 ROC(接收器工作特性)曲线
AUC (area under the curve) and ROC (receiver operating characteristics) curves

这些曲线提供了分类模型在各种阈值下的性能度量。我们可以使用这些曲线来衡量某个变量对某个结果的预测效果;例如,GRE 科目考试成绩对第一年通过研究生院资格考试的预测效果如何?

These curves provide a performance measure for a classification model at various threshold values. We can use these curves to measure how well a certain variable predicts a certain outcome; for example, how well does the GRE subject test score predict passing a graduate school’s qualifying exam in the first year?

Andrew Ng 的书《机器学习向往》(自行出版)为性能指标最佳实践提供了极好的指南。在深入研究真正的人工智能应用之前,请仔细阅读,因为这本书的秘诀是基于许多试验、成功、和失败。

Andrew Ng’s book, Machine Learning Yearning (self-published), provides an excellent guide for performance metrics best practices. Please read carefully before diving into real AI applications, since the book’s recipes are based on many trials, successes, and failures.

总结与展望

Summary and Looking Ahead

在本章中,我们调查了一些最流行的机器学习模型,强调了整本书中出现的特定数学结构:训练函数、损失函数和优化。我们讨论了线性回归、逻辑回归和 Softmax 回归,然后轻松讨论了支持向量机、决策树、集成和随机森林。

In this chapter, we surveyed some of the most popular machine learning models, emphasizing a particular mathematical structure that appears throughout the book: training function, loss function, and optimization. We discussed linear, logistic, and softmax regression, then breezed over support vector machines, decision trees, ensembles, and random forests.

此外,我们为从数学中研究这些主题提供了一个很好的案例:

Moreover, we made a decent case for studying these topics from mathematics:

结石
Calculus

最小值和最大值发生在边界处或一导数为零或不存在的点处。

The minimum and maximum happen at the boundary or at points where one derivative is zero or does not exist.

线性代数
Linear algebra
  • 线性组合特征: ω 1 X 1 + ω 2 X 2 + + ω n X n

  • 使用矩阵和向量符号编写各种数学表达式。

  • 两个向量的标量积 A t

  • 2 向量的范数。

  • 避免使用病态矩阵。摆脱线性相关的特征。这也与特征选择有关。

  • 避免矩阵相互相乘;这太贵了。将矩阵乘以向量。

  • Linearly combining features: ω 1 x 1 + ω 2 x 2 + + ω n x n .

  • Writing various mathematical expressions using matrix and vector notation.

  • The scalar product of two vectors a t b .

  • The l 2 norm of a vector.

  • Avoid working with ill-conditioned matrices. Get rid of linearly dependent features. This also has to do with feature selection.

  • Avoid multiplying matrices by each other; this is too expensive. Multiply matrices by vectors instead.

优化
Optimization
  • 对于凸函数,我们不担心陷入局部最小值,因为局部最小值也是全局最小值。我们确实担心狭窄的山谷(见第 4 章)。

  • 梯度下降法只需要一个导数(参见第 4 章)。

  • 牛顿方法需要两个导数或两个导数的近似(对于大数据不方便)。

  • 二次规划、对偶问题和坐标下降(都出现在支持向量机中)。

  • For convex functions, we do not worry about getting stuck at local minima since local minima are also global minima. We do worry about narrow valleys (see Chapter 4).

  • Gradient descent methods need only one derivative (see Chapter 4).

  • Newton’s methods need two derivatives or an approximation of two derivatives (inconvenient for large data).

  • Quadratic programming, the dual problem, and coordinate descent (all appear in support vector machines).

统计数据
Statistics
  • 相关矩阵和散点图

  • 特征选择的 F 检验和互信息检验

  • 标准化数据特征(减去均值并除以标准差)

  • Correlation matrix and scatterplots

  • The F-test and mutual information test for feature selection

  • Standardizing the data features (subtracting the mean and dividing by the standard deviation)

我们还没有也不会讨论更多步骤:

More steps we did not and will not go over (yet):

  • 验证我们的模型:调整权重值和超参数以免过度拟合。

  • Validating our models: tune the weight values and the hyperparameters so as not to overfit.

  • 在数据的测试子集上测试经过训练的模型,我们的模型在训练和验证步骤期间没有使用(或见过)这些数据(我们在随附的 Jupyter 笔记本中执行此操作)。

  • Test the trained model on the testing subset of the data, which our model had not used (or seen) during the training and validation steps (we do this in the accompanying Jupyter notebook).

  • 部署并监控最终模型。

  • Deploy and monitor the finalized model.

  • 永远不要停止思考如何改进我们的模型以及如何更好地将它们集成到整个生产流程中。

  • Never stop thinking on how to improve our models and how to better integrate them into the whole production pipeline.

第 4 章中,我们进入了令人兴奋的神经网络新时代。

In Chapter 4, we step into the new and exciting era of neural networks.

第 4 章神经网络优化

Chapter 4. Optimization for Neural Networks

我生命中的每一天都在优化……我的第一个顿悟时刻是当我了解到我们的大脑也学习世界模型时。

H。

I have lived each and every day of my life optimizing…​.My first aha moment was when I learned that our brain, too, learns a model of the world.

H.

各种人工神经网络他们的架构中有完全连接的层。在本章中,我们将解释完全连接的神经网络的数学原理,并通过真实数据集演练端到端示例。我们设计并试验了各种训练和损失函数。我们还解释说,训练神经网络时使用的优​​化和反向传播步骤与我们大脑中的学习过程类似。大脑通过在面对以前见过的概念时加强神经元连接来进行学习,并在学习到与之前学到的概念相矛盾的新信息时削弱连接。机器只能理解数字。从数学上讲,较强的连接对应于较大的数字,较弱的连接对应于较小的数字。

Various artificial neural networks have fully connected layers in their architecture. In this chapter, we explain how the mathematics of a fully connected neural network works and walk through an end-to-end example with a real data set. We design and experiment with various training and loss functions. We also explain that the optimization and backpropagation steps used when training neural networks are similar to how learning happens in our brains. The brain learns by reinforcing neuron connections when faced with a concept it has seen before, and weakening connections if it learns new information that contradicts previously learned concepts. Machines only understand numbers. Mathematically, stronger connections correspond to larger numbers, and weaker connections correspond to smaller numbers.

最后,我们介绍各种正则化技术,解释它们的优点、缺点和用例。

Finally, we walk through various regularization techniques, explaining their advantages, disadvantages, and use cases.

大脑皮层和人工神经网络

The Brain Cortex and Artificial Neural Networks

神经网络以大脑皮层为模型,其中涉及以分层结构排列的数十亿个神经元。图4-1显示了大脑新皮质的三个垂直横截面的图像,图4-2显示了全连接的人工神经网络的图。

Neural networks are modeled after the brain cortex, which involves billions of neurons arranged in a layered structure. Figure 4-1 shows an image of three vertical cross-sections of the brain neocortex, and Figure 4-2 shows a diagram of a fully connected artificial neural network.

埃麦0401
图 4-1。圣地亚哥·拉蒙·卡哈尔 (Santiago Ramón y Cajal) 绘制的三幅皮质分层图,摘自《人类皮质感觉区域的比较研究》 (Andesite Press) 一书(图片来源:维基百科

图4-1中,每张图都显示了皮质的垂直横截面,皮质的表面(最靠近头骨的最外侧)位于顶部。左边是尼氏染色的成年人视觉皮层。中间是尼氏染色的成年人运动皮层。右边是一个半月大婴儿的高尔基染色皮质。尼氏染色显示神经元的细胞体。高尔基体染色显示神经元随机子集的树突和轴突。皮层神经元的分层结构在所有三个横截面中都很明显。

In Figure 4-1, each drawing shows a vertical cross-section of the cortex, with the surface (outermost side closest to the skull) of the cortex at the top. On the left is a Nissl-stained visual cortex of a human adult. In the middle is a Nissl-stained motor cortex of a human adult. On the right is a Golgi-stained cortex of a month-and-a half-old infant. The Nissl stain shows the cell bodies of neurons. The Golgi stain shows the dendrites and axons of a random subset of neurons. The layered structure of the neurons in the cortex is evident in all three cross-sections.

埃麦0402
图 4-2。具有四层的全连接或密集人工神经网络

尽管皮层的不同区域负责不同的功能,例如视觉、听觉、逻辑思维、语言、言语等,但真正决定特定区域功能的是它的联系:哪些感觉和运动技能输入和哪些连接。它连接到的输出区域。这意味着,如果皮质区域连接到不同的感觉输入/输出区域(例如,视觉局部而不是听觉局部),那么它将执行视觉任务(计算),而不是听觉任务。从非常简单的意义上来说,皮层在神经元水平上执行一项基本功能。在人工神经网络中,基本计算单元是感知器,并且它在整个网络中以相同的方式运行。神经网络(大脑皮层和人工神经网络)的各种连接、层和架构使得这些计算结构能够做出令人印象深刻的事情事物。

Even though different regions of the cortex are responsible for different functions, such as vision, auditory perception, logical thinking, language, speech, etc., what actually determines the function of a specific region are its connections: which sensory and motor skills input and output regions it connects to. This means that if a cortical region is connected to a different sensory input/output region—for example, a vision locality instead of an auditory one—then it will perform vision tasks (computations), not auditory tasks. In a very simplified sense, the cortex performs one basic function at the neuron level. In an artificial neural network, the basic computation unit is the perceptron, and it functions in the same way across the whole network. The various connections, layers, and architecture of the neural network (both the brain cortex and artificial neural networks) are what allow these computational structures to do very impressive things.

训练功能:全连接或密集前馈神经网络

Training Function: Fully Connected, or Dense, Feed Forward Neural Networks

一个完全连接密集的人工神经网络(见图4-2),每一层中的每个神经元(由一个节点(圆圈)表示)都连接到下一层中的所有神经元。第一层是输入层,最后一层是输出层,中间层是称为隐藏层。神经网络本身,无论是全连接还是非全连接(我们在新几章中将遇到的网络都是卷积的,不是全连接的),都是一个代表训练函数公式的计算图。回想一下,我们使用这个函数在训练后进行预测。

In a fully connected or dense artificial neural network (see Figure 4-2), every neuron, represented by a node (the circles), in every layer is connected to all the neurons in the next layer. The first layer is the input layer, the last layer is the output layer, and the intermediate layers are called hidden layers. The neural network itself, whether fully connected or not (the networks that we will encounter in the new few chapters are convolutional and are not fully connected), is a computational graph representing the formula of the training function. Recall that we use this function to make predictions after training.

在神经网络环境中进行训练意味着通过最小化损失函数来找到进入训练函数公式的参数值或权重。这类似于我们在第 3 章中讨论的训练线性回归、逻辑回归、softmax 回归和支持向量机模型。这里的数学结构保持不变:

Training in the neural networks context means finding the parameter values, or weights, that enter into the formula of the training function via minimizing a loss function. This is similar to training linear regression, logistic regression, softmax regression, and support vector machine models, which we discussed in Chapter 3. The mathematical structure here remains the same:

  1. 训练功能

  2. Training function

  3. 损失函数

  4. Loss function

  5. 优化

  6. Optimization

唯一的区别是,对于第3章的简单模型,训练函数的公式非常简单。他们线性组合数据特征,添加偏差项( ω 0 ),并将结果传递至至多一个非线性函数(例如逻辑回归中的逻辑函数)。因此,这些模型的结果也很简单:线性回归的线性(平坦)函数,以及逻辑回归、softmax回归和支持向量机中不同类之间的线性划分边界。即使当我们使用这些简单模型来表示非线性数据时,例如在多项式回归(将数据拟合到特征的多项式函数中)或具有核技巧的支持向量机的情况下,我们仍然会得到线性函数或划分边界,但它们要么是更高的维度(对于多项式回归,维度是特征及其幂),要么是变换的维度(例如当我们在支持向量机中使用核技巧时)。

The only difference is that for the simple models of Chapter 3, the formulas of the training functions are very uncomplicated. They linearly combine the data features, add a bias term ( ω 0 ), and pass the result into at most one nonlinear function (for example, the logistic function in logistic regression). As a consequence, the results of these models are also simple: a linear (flat) function for linear regression, and a linear division boundary between different classes in logistic regression, softmax regression, and support vector machines. Even when we use these simple models to represent nonlinear data, such as in the cases of polynomial regression (fitting the data into polynomial functions of the features) or support vector machines with the kernel trick, we still end up with linear functions or division boundaries, but these will either be in higher dimensions (for polynomial regression, the dimensions are the feature and its powers) or in transformed dimensions (such as when we use the kernel trick with support vector machines).

另一方面,对于神经网络模型,线性组合特征的过程,添加偏差项,然后将结果传递给非线性函数(现在称为激活函数)是仅在一个神经元中发生的计算。这个简单的过程在数十、数百、数千甚至数百万个分层排列的神经元中一遍又一遍地发生,其中一层的输出充当下一层的输入。与大脑皮层类似,许多神经元和层上简单且相似的过程的聚合产生或允许表示更复杂的功能。这有点神奇。值得庆幸的是,我们能够比我们大脑的神经网络更多地了解人工神经网络,主要是因为我们设计了它们,毕竟人工神经网络只是一个数学函数。一旦我们在数学的镜头下剖析黑匣子,它就不再是黑暗的。也就是说,人工神经网络的数学分析是一个相对较新的领域。还有很多问题需要回答,还有很多事情需要解决发现。

For neural network models, on the other hand, the process of linearly combining the features, adding a bias term, then passing the result through a nonlinear function (now called activation function) is the computation that happens only in one neuron. This simple process happens over and over again in dozens, hundreds, thousands, or sometimes millions of neurons, arranged in layers, where the output of one layer acts as the input of the next layer. Similar to the brain cortex, the aggregation of simple and similar processes over many neurons and layers produces, or allows for the representation of, much more complex functionalities. This is sort of miraculous. Thankfully, we are able to understand much more about artificial neural networks than our brain’s neural networks, mainly because we design them, and after all, an artificial neural network is just one mathematical function. No black box remains dark once we dissect it under the lens of mathematics. That said, the mathematical analysis of artificial neural networks is a relatively new field. There are still many questions to be answered and a lot to be discovered.

神经网络是训练函数的计算图表示

A Neural Network Is a Computational Graph Representation of the Training Function

即使是对于只有 5 个神经元的网络,如图4-3所示,训练函数的公式写起来相当混乱。这证明了使用计算图以有组织且简单的方式表示神经网络是合理的。图的特征有两个:节点和边(恭喜,这是图论的第一课)。在神经网络中,连接第 m层中的节点i到第 n 层中的节点j被分配一个权重 ω n,j 。这就是只有一条边的四个索引!冒着淹没在索引深海的风险,我们必须在矩阵中组织神经网络的权重。

Even for a network with only five neurons, such as the one in Figure 4-3, it is pretty messy to write the formula of the training function. This justifies the use of computational graphs to represent neural networks in an organized and easy way. Graphs are characterized by two things: nodes and edges (congratulations, this was lesson one in graph theory). In a neural network, an edge connecting node i in layer m to node j in layer n is assigned a weight ω mn,ij . That is four indices for only one edge! At the risk of drowning in a deep ocean of indices, we must organize a neural network’s weights in matrices.

埃麦0403
图 4-3。一个全连接(或密集)前馈神经网络,只有五个神经元排列在三层中。第一层(最左边的三个黑点)是输入层,第二层是唯一具有三个神经元的隐藏层,最后一层是具有两个神经元的输出层。

让我们对训练函数进行建模前馈全连接神经网络。前馈意味着信息通过代表网络训练函数的计算图向前流动。

Let’s model the training function of a feed forward fully connected neural network. Feed forward means that the information flows forward through the computational graph representing the network’s training function.

线性组合,添加偏置,然后激活

Linearly Combine, Add Bias, Then Activate

当神经元接收到来自其他神经元的输入时,它会发生什么样的计算?使用不同权重线性组合输入信息,添加偏置项,然后使用非线性函数激活神经元。我们将一步一步地完成这一过程。

What kind of computations happen within a neuron when it receives input from other neurons? Linearly combine the input information using different weights, add a bias term, then use a nonlinear function to activate the neuron. We will go through this process one step at a time.

权重

The weights

矩阵 1 包含与隐藏层 1相关的边的权重, 2 包含与隐藏层 2 相关的边的权重,依此类推,直到到达输出层。

Let the matrix W 1 contain the weights of the edges incident to hidden layer 1, W 2 contain the weights of the edges incident to hidden layer 2, and so on, until we reach the output layer.

因此,对于图 4-3所示的小型神经网络,我们只有h = 1 个隐藏层,获得两个权重矩阵:

So for the small neural network represented in Figure 4-3, we only have h = 1 hidden layer, obtaining two matrices of weights:

1 = ω 11 1 ω 12 1 ω 13 1 ω 21 1 ω 22 1 ω 23 1 ω 31 1 ω 32 1 ω 33 1 H+1 = 2 = tpt = ω 11 2 ω 12 2 ω 13 2 ω 21 2 ω 22 2 ω 23 2 ,

其中上标表示边缘指向的层。请注意,如果输出层只有一个节点而不是两个,那么最后一个权重矩阵 H+1 = tpt 只会是一个行向量:

where the superscripts indicate the layer to which the edges point. Note that if we only had one node at the output layer instead of two, then the last matrix of weights W h+1 = W output would only be a row vector:

H+1 = 2 = tpt = ω 11 2 ω 12 2 ω 13 2

现在,在该神经网络的一个节点上,发生两项计算:

Now at one node of this neural network, two computations take place:

  1. 线性组合加偏置

  2. A linear combination plus bias

  3. 将结果传递给非线性激活函数(微积分的复合运算)

  4. Passing the result through a nonlinear activation function (the composition operation from calculus)

我们对这两者进行详细阐述,然后最终构建代表的全连接前馈神经网络的训练函数如图4-3所示。

We elaborate on these two, then ultimately construct the training function of the fully connected feed forward neural network represented in Figure 4-3.

线性组合加偏置

A linear combination plus bias

第一个隐藏层(这个小型网络的唯一隐藏层)中的第一个节点,我们线性组合输入:

At the first node in the first hidden layer (the only hidden layer for this small network), we linearly combine the inputs:

z 1 1 = ω 11 1 X 1 + ω 12 1 X 2 + ω 13 1 X 3 + ω 01 1

在第一个隐藏层的第二个节点,我们使用与之前的线性组合不同的权重来线性组合输入:

At the second node in the first hidden layer, we linearly combine the inputs using different weights than the previous linear combination:

z 2 1 = ω 21 1 X 1 + ω 22 1 X 2 + ω 23 1 X 3 + ω 02 1

在第一个隐藏层的第三个节点,我们使用与前两个线性组合不同的权重来线性组合输入:

At the third node in the first hidden layer, we linearly combine the inputs using different weights than the previous two linear combinations:

z 3 1 = ω 31 1 X 1 + ω 32 1 X 2 + ω 33 1 X 3 + ω 03 1

让我们使用向量和矩阵符号来表达上面的三个方程。这对于我们以后的优化任务来说非常方便,当然它也会保持我们的理智:

Let’s express the three equations above using vector and matrix notation. This will be extremely convenient for our optimization task later, and of course it will preserve our sanity:

z 1 1 z 2 1 z 3 1 = ω 11 1 ω 21 1 ω 31 1 X 1 + ω 12 1 ω 22 1 ω 32 1 X 2 + ω 13 1 ω 23 1 ω 33 1 X 3 + ω 01 1 ω 02 1 ω 03 1 = ω 11 1 ω 12 1 ω 13 1 ω 21 1 ω 22 1 ω 23 1 ω 31 1 ω 32 1 ω 33 1 X 1 X 2 X 3 + ω 01 1 ω 02 1 ω 03 1

我们现在可以简洁地总结上面的表达式作为:

We can now summarize the above expression compactly as:

z 1 = 1 X + ω 0 1

将结果传递给非线性激活函数

Pass the result through a nonlinear activation function

线性组合这些特征和添加偏差不足以获取数据中更复杂的信息,如果没有这个关键但非常简单的步骤,神经网络将永远不会成功:在隐藏层的每个节点上使用非线性函数进行组合

Linearly combining the features and adding bias are not enough to pick up on more complex information in the data, and neural networks would have never been successful without this crucial but very simple step: compose with a nonlinear function at each node of the hidden layers.

我们是决定非线性激活函数公式的人,不同的节点可以有不同的激活函数,尽管在实践中很少这样做。设f为该激活函数,则第一个隐藏层的输出为:

We are the ones who decide on the formula for the nonlinear activation function, and different nodes can have different activation functions, even though it is rare to do this in practice. Let f be this activation function, then the output of the first hidden layer will be:

s 1 = F z 1 = F 1 X + ω 0 1

现在很容易看出,如果我们有更多的隐藏层,它们的输出将与前面层的输出链接起来,使得编写训练函数有点乏味:

It is now straightforward to see that if we had more hidden layers, their outputs will be chained with those of previous layers, making writing the training function a bit tedious:

s 2 = F z 2 = F 2 s 1 + ω 0 2 = F 2 F 1 X + ω 0 1 + ω 0 2 s 3 = F z 3 = F 3 s 2 + ω 0 3 = F 3 F 2 F 1 X + ω 0 1 + ω 0 2 + ω 0 3

这种链接一直持续到我们到达输出层。最后一层发生的情况取决于网络的任务。如果目标是回归(预测一个数值)或二元分类(分为两类),那么我们只有一个输出节点(见图4-4)。

This chaining goes on until we reach the output layer. What happens at this very last layer depends on the task of the network. If the goal is regression (predict one numerical value) or binary classification (classify into two classes), then we only have one output node (see Figure 4-4).

埃麦0404
图 4-4。完全连接(或密集)的前馈神经网络,只有九个神经元,排列在四层中。最左边的第一层是输入层,第二层和第三层是两个隐藏层,每个隐藏层有四个神经元,最后一层是只有一个神经元的输出层(该网络执行回归任务或二进制任务)分类任务)。
  • 如果任务是回归,我们在最终输出节点线性组合前一层的输出,添加偏差,然后返回(在这种情况下,我们通过非线性函数传递结果)。由于输出层只有一个节点,因此输出矩阵只是一个行向量 tpt = H+1 ,以及一个偏向 ω 0 H+1 。网络的预测现在将是:

    y predCt = H+1 s H + ω 0 H+1

    其中h是网络中隐藏层的总数(不包括输入层和输出层)。

  • If the task is regression, we linearly combine the outputs of the previous layer at the final output node, add bias, and go home (we do not pass the result through a nonlinear function in this case). Since the output layer only has one node, the output matrix is just a row vector W output = W h+1 , and one bias ω 0 h+1 . The prediction of the network will now be:

    y predict = W h+1 s h + ω 0 h+1

    where h is the total number of hidden layers in the network (this does not include the input and output layers).

  • 另一方面,如果任务是二元分类,那么我们也只有一个输出节点,我们在其中线性组合前一层的输出,添加偏差,然后将结果传递给逻辑函数 σ s = 1 1+e -s ,得出网络的预测:

    y predCt = σ H+1 s H + ω 0 H+1
  • If, on the other hand, the task is binary classification, then again we have only one output node, where we linearly combine the outputs of the previous layer, add bias, then pass the result through the logistic function σ ( s ) = 1 1+e -s , resulting in the network’s prediction:

    y predict = σ ( W h+1 s h + ω 0 h+1 )
  • 如果任务是分类为多个类,例如五个类,则输出层将包括五个节点。在每个节点,我们线性组合前一层的输出,添加偏差,然后将结果传递给 softmax 函数:

    σ z 1 = e z 1 e z 1 +e z 2 +e z 3 +e z 4 +e z 5 σ z 2 = e z 2 e z 1 +e z 2 +e z 3 +e z 4 +e z 5 σ z 3 = e z 3 e z 1 +e z 2 +e z 3 +e z 4 +e z 5 σ z 4 = e z 4 e z 1 +e z 2 +e z 3 +e z 4 +e z 5 σ z 5 = e z 5 e z 1 +e z 2 +e z 3 +e z 4 +e z 5
  • If the task is to classify into multiple classes, say, five classes, then the output layer would include five nodes. At each of these nodes, we linearly combine the outputs of the previous layer, add bias, then pass the result through the softmax function:

    σ ( z 1 ) = e z 1 e z 1 +e z 2 +e z 3 +e z 4 +e z 5 σ ( z 2 ) = e z 2 e z 1 +e z 2 +e z 3 +e z 4 +e z 5 σ ( z 3 ) = e z 3 e z 1 +e z 2 +e z 3 +e z 4 +e z 5 σ ( z 4 ) = e z 4 e z 1 +e z 2 +e z 3 +e z 4 +e z 5 σ ( z 5 ) = e z 5 e z 1 +e z 2 +e z 3 +e z 4 +e z 5

将它们分组为向量函数 σ 也将向量作为输入: σ z ,那么神经网络的最终预测是五个概率分数的向量,其中数据实例属于每个五个班级:

Group those into a vector function σ that also takes vectors as input: σ ( z ) , then the final prediction of the neural network is a vector of five probability scores where a data instance belongs to each of the five classes:

y predCt = σ z = σ tpt s H + ω 0 H+1

符号概述

Notation overview

在我们对神经网络的讨论中,我们将尽力保持符号的一致。x是输入特征,W是矩阵或列向量,其中包含我们用于线性组合的权重 ω 0 是有时分组为向量的偏差,z线性组合加上偏差的结果,s是将z传递到非线性激活函数的结果。

We will try to remain consistent with notation throughout our discussion of neural networks. The x’s are the input features, the W’s are the matrices or column vectors containing the weights that we use for linear combinations, the ω 0 ’s are the biases that are sometimes grouped into a vector, the z’s are the results of linear combinations plus biases, and the s’s are the results of passing the z’s into the nonlinear activation functions.

常用激活函数

Common Activation Functions

理论上,我们可以使用任何非线性函数来激活我们的节点(想想我们遇到过的所有微积分函数)。在实践中,有一些流行的,下面列出并如图 4-5所示。

In theory, we can use any nonlinear function to activate our nodes (think of all the calculus functions we’ve ever encountered). In practice, there are some popular ones, listed next and graphed in Figure 4-5.

到目前为止整流线性单元函数(ReLU)是当今网络中最常用的, 2012 年AlexNet的成功部分归功于该激活函数的使用,而不是常用的双曲正切和逻辑函数(sigmoid)当时在神经网络中(并且仍在使用)。

By far the Rectified Linear Unit function (ReLU) is the most commonly used in today’s networks, and the success of AlexNet in 2012 is partially attributed to the use of this activation function, as opposed to the hyperbolic tangent and logistic functions (sigmoid) that were commonly used in neural networks at the time (and are still in use).

下面列表和图 4-5中的前四个函数都受到计算神经科学的启发,它们试图对一个神经元细胞的激活(放电)阈值进行建模。它们的图表看起来彼此相似:有些是其他图表的更平滑变体,有些仅输出正数,有些则输出 –1 和 1 之间或之间的更平衡数字 - π 2 π 2 。它们对于小输入或大输入都会饱和,这意味着它们的图形对于大输入而言会变得平坦。这给学习带来了一个问题,因为如果这些函数一遍又一遍地输出相同的数字,就不会发生太多的学习。

The first four functions in the following list and in Figure 4-5 are all inspired by computational neuroscience, where they attempt to model a threshold for the activation (firing) of one neuron cell. Their graphs look similar to each other: some are smoother variants of others, some output only positive numbers, others output more balanced numbers between –1 and 1, or between - π 2 and π 2 . They all saturate for small or large inputs, meaning their graphs become flat for inputs large in magnitude. This creates a problem for learning, since if these functions output the same numbers over and over again, there will not be much learning happening.

从数学上讲,这种现象本身就表现出来作为梯度消失问题第二组激活函数试图纠正这个饱和问题,它确实做到了,如图4-5中第二行的图表所示。然而,这又带来了另一个问题,称为梯度爆炸问题,因为这些激活函数是无界的,现在可以输出大数字,如果这些数字在多个层上增长,我们就会遇到问题。

Mathematically, this phenomenon manifests itself as the vanishing gradient problem. The second set of activation functions attempts to rectify this saturation problem, which it does, as we see in the graphs of the second row in Figure 4-5. This, however, introduces another problem, called the exploding gradient problem, since these activation functions are unbounded and can now output big numbers, and if these numbers grow over multiple layers, we have a problem.

埃麦0405
图 4-5。神经网络的各种激活函数。第一行由 S 型激活函数组成,形状像字母 S。对于幅度较大的输入,这些激活函数会饱和(变得平坦并输出相同的值)。第二行由 ReLU 型激活函数组成,不会饱和。一位工程师指出这些激活函数与晶体管的物理功能的类比。

引入的每一组新问题都有其自己的一套技术试图解决它,例如梯度裁剪、标准化每层后的输出等。最重要的教训是,这些都不是魔法。其中很多都是反复试验,并且出现了新方法来解决其他新方法引入的问题。我们只需要了解原理、原因和方式,并充分了解该领域流行的内容,同时对改进事物或以完全不同的方式做事保持开放的态度。

Every new set of problems that gets introduced comes with its own set of techniques attempting to fix it, such as gradient clipping, normalizing the outputs after each layer, etc. The take-home lesson is that none of this is magic. A lot of it is trial and error, and new methods emerge to fix problems that other new methods introduced. We only need to understand the principles, the why and the how, and get a decent exposure to what is popular in the field, while keeping an open mind for improving things, or doing things entirely differently.

让我们列出常见激活函数的公式及其导数。当我们优化损失函数以寻找神经网络的最佳权重时,我们需要计算训练函数的一阶导数:

Let’s state the formulas of common activation functions, as well as their derivatives. We need to calculate one derivative of the training function when we optimize the loss function in our search for the best weights of the neural network:

  • 步骤函数: F z = 0 如果 z < 0 1 如果 z 0

    其导数: F ' z = 0 如果 z 0 n d e F n e d 如果 z = 0

  • Step function: f ( z ) = 0 if z < 0 1 if z 0

    Its derivative: f ' ( z ) = 0 if z 0 u n d e f i n e d if z = 0

  • 物流功能: σ z = 1 1+e -z

    其导数: σ ' z = e -z 1+e -z 2 = σ z 1 - σ z

  • Logistic function: σ ( z ) = 1 1+e -z .

    Its derivative: σ ' ( z ) = e -z (1+e -z ) 2 = σ ( z ) ( 1 - σ ( z ) ) .

  • 双曲正切函数: 正值 z = e z -e -z e z +e -z = 2 1+e -2z - 1

    其导数: 正值 ' z = 4 e z +e -z 2 = 1 - F z 2

  • Hyperbolic tangent function: tanh ( z ) = e z -e -z e z +e -z = 2 1+e -2z - 1

    Its derivative: tanh ' ( z ) = 4 (e z +e -z ) 2 = 1 - f (z) 2

  • 反正切函数: F z = 反正切 z

    其导数: F ' z = 1 1+z 2

  • Inverse tangent function: f ( z ) = arctan ( z ) .

    Its derivative: f ' ( z ) = 1 1+z 2 .

  • 修正线性单位函数或 ReLU(z): F z = 0 如果 z < 0 z 如果 z 0

    其导数: F ' z = 0 如果 z < 0 n d e F n e d 如果 z = 0 1 如果 z > 0

  • Rectified Linear Unit function or ReLU(z): f ( z ) = 0 if z < 0 z if z 0

    Its derivative: f ' ( z ) = 0 if z < 0 u n d e f i n e d if z = 0 1 if z > 0

  • 泄漏修正线性单元函数(或参数线性单元): F z = α z 如果 z < 0 z 如果 z 0

    其导数: F ' z = α 如果 z < 0 n d e F n e d 如果 z = 0 1 如果 z > 0

  • Leaky Rectified Linear Unit function (or parametric linear unit): f ( z ) = α z if z < 0 z if z 0

    Its derivative: f ' ( z ) = α if z < 0 u n d e f i n e d if z = 0 1 if z > 0

  • 指数线性单位函数: F z = α e z - 1 如果 z < 0 z 如果 z 0

    其导数: F ' z = F z + α 如果 z < 0 1 如果 z 0

  • Exponential Linear Unit function: f ( z ) = α ( e z - 1 ) if z < 0 z if z 0

    Its derivative: f ' ( z ) = f ( z ) + α if z < 0 1 if z 0

  • 软加功能: F z = 1 + e z

    其导数: F ' z = 1 1+e -z = σ z

  • Softplus function: f ( z ) = ln ( 1 + e z )

    Its derivative: f ' ( z ) = 1 1+e -z = σ ( z )

请注意,所有这些激活都是相当基本的函数。这是一件好事,因为它们及其衍生物通常在神经网络的训练、测试和部署过程中涉及数千个参数(权重)和数据实例的大规模计算,因此最好保持它们的基本性。另一个原因是,从理论上讲,由于下面讨论的通用函数逼近定理,我们最终选择什么激活函数并不重要。这里要小心:在操作上,我们为神经网络节点选择什么激活函数绝对很重要。正如我们在本节前面提到的,AlexNet 在图像分类任务中的成功部分归功于其使用修正线性单元函数 ReLU(z)。在这种情况下,理论和实践并不矛盾,尽管表面上看起来如此。我们在中对此进行了解释下一小节。

Note that all of these activations are rather elementary functions. This is a good thing, since both they and their derivatives are usually involved in massive computations with thousands of parameters (weights) and data instances during training, testing, and deployment of neural networks, so better keep them elementary. The other reason is that in theory, it doesn’t really matter what activation function we end up choosing because of the universal function approximation theorems, discussed next. Careful here: operationally, it definitely matters what activation function we choose for our neural network nodes. As we mentioned earlier in this section, AlexNet’s success in image classification tasks is partly due to its use of the Rectified Linear Unit function, ReLU(z). Theory and practice do not contradict each other in this case, even though it seems so on the surface. We explain this in the next subsection.

通用函数逼近

Universal Function Approximation

近似定理,当可用时,是很棒的,因为它们以数学的信心和权威告诉我们,如果我们有一个我们不知道的函数,或者我们知道但很难包含在我们的计算中,那么我们不必完全处理这个未知或困难的功能。相反,我们可以使用已知的函数来近似它,这些函数更容易计算,并且精度很高。这意味着在某些条件下,对于未知或复杂的函数以及已知的简单(有时是基本)函数,我们可以使用简单函数并确信我们的计算正在做正确的事情。这些类型的近似定理量化了真实函数与其近似值的偏差,因此我们确切地知道用该近似值替换真实函数时会犯多少错误。

Approximation theorems, when available, are awesome because they tell us, with mathematical confidence and authority, that if we have a function that we do not know, or that we know but that is difficult to include in our computations, then we do not have to deal with this unknown or difficult function altogether. We can, instead, approximate it using known functions that are much easier to compute, to a great degree of precision. This means that under certain conditions on both the unknown or complicated function, and the known and simple (sometimes elementary) functions, we can use the simple functions and be confident that our computations are doing the right thing. These types of approximation theorems quantify how far off the true function is from its approximation, so we know exactly how much error we are committing when substituting the true function with this approximation.

事实证明,神经网络(甚至有时只有一个隐藏层的非深度神经网络)能够成功地完成视觉、语音识别、分类、回归等领域的各种任务,这意味着它们具有某种通用逼近性质神经网络表示的训练函数(由基本线性组合、偏差和非常简单的激活函数构建)近似于真正很好地表示或生成数据的底层未知函数。

The fact that neural networks, even sometimes nondeep neural networks with only one hidden layer, have proved so successful for accomplishing various tasks in vision, speech recognition, classification, regression, and others means that they have some universal approximation property going on for them. The training function that a neural network represents (built from elementary linear combinations, biases, and very simple activation functions) approximates the underlying unknown function that truly represents or generates the data rather well.

数学家现在必须用一个定理或一堆定理来回答的自然问题是:

The natural questions that mathematicians must now answer with a theorem, or a bunch of theorems, are:

给定一些我们不知道但我们真正关心的函数(因为我们认为它是底层或生成数据的真实函数),是否有一个神经网络可以将其近似到很高的精度(而无需知道这个真实的功能吗)?
Given some function that we don’t know but we really care about (because we think it is the true function underlying or generating our data), is there a neural network that can approximate it to a good degree of precision (without ever having to know this true function)?

神经网络的成功使用实践表明答案是肯定的,并且神经网络的万能逼近定理证明对于某一类函数和网络来说答案是肯定的。

Practice using neural networks successfully suggests that the answer is yes, and universal approximation theorems for neural networks prove that the answer is yes for a certain class of functions and networks.

如果有一个神经网络可以近似这个真实且难以捉摸的数据生成函数,我们如何构建它?它应该有几层?每层有多少个节点?它应该包含什么类型的激活函数?
If there is a neural network that approximates this true and elusive data generating function, how do we construct it? How many layers should it have? How many nodes in each layer? What type of activation function should it include?

换句话说,这个网络的架构是怎样的?遗憾的是,到目前为止,人们对如何构建这些网络知之甚少,在更多数学家开始研究之前,对各种架构和激活进行实验是唯一的出路。

In other words, what is the architecture of this network? Sadly, as of now, little is known on how to construct these networks, and experimentation with various architectures and activations is the only way forward until more mathematicians get on this.

是否存在运行良好的多种神经网络架构?是否有一些比其他更好的?
Are there multiple neural network architecture that work well? Are there some that are better than others?

实验表明,考虑到不同架构在相同任务和数据集上的性能相当,答案是肯定的。

Experiments suggest that the answer is yes, given the comparable performance of various architectures on the same tasks and data sets.

请注意,对这些问题有明确的答案非常有用。第一个问题的肯定答案告诉我们:嘿,这里没有什么魔法,神经网络确实可以很好地近似一类广泛的函数!这种广泛的覆盖范围或普遍性至关重要,因为回想一下,我们不知道数据的底层生成函数,但如果近似定理涵盖了广泛的函数类别,那么我们未知且难以捉摸的函数也可能被包括在内,因此神经网络的成功。回答第二组和第三组问题对于实际应用来说更加有用,因为如果我们知道哪种架构最适合每种任务类型和数据集,那么我们就可以省去这么多的实验,并且我们会立即选择一种架构表现良好。

Note that having definite answers for these questions is very useful. An affirmative answer to the first question tells us: hey, there is no magic here, neural networks do approximate a wide class of functions rather well! This wide coverage, or universality, is crucial, because recall that we do not know the underlying generating function of the data, but if the approximation theorem covers a wide class of functions, our unknown and elusive function might as well be included, hence the success of the neural network. Answering the second and third sets of questions is even more useful for practical applications, because if we know which architecture works best for each task type and data set, then we would be saved from so much experimentation, and we’d immediately choose an architecture that performs well.

在陈述神经网络的通用逼近定理并讨论其证明之前,让我们先看两个例子,我们在中学时就已经遇到过逼近型定理。同样的原则适用于所有示例:我们有一个难以控制的量,无论出于何种原因都难以处理或未知,并且我们希望使用另一个更容易处理的量来近似它。如果我们想要通用的结果,我们需要指定三件事:

Before stating the universal approximation theorems for neural networks and discussing their proofs, let’s go over two examples where we already encountered approximation type theorems, even when we were in middle school. The same principle applies for all examples: we have an unruly quantity that for whatever reason is difficult to deal with or is unknown, and we want to approximate it using another quantity that is easier to deal with. If we want universal results, we need to specify three things:

  1. 不规则的数量或功能属于什么类或什么空间?是实数集吗 ?无理数的集合?区间上连续函数的空间?紧支持函数的空间 ?勒贝格可测函数的空间(我确实在这里插入了一些测度理论的东西,希望没有人注意到或逃跑)?ETC。

  2. What class or what kind of space does the unruly quantity or function belong to? Is it the set of real numbers ? The set of irrational numbers? The space of continuous functions on an interval? The space of compactly supported functions on ? The space of Lebesgue measurable functions (I did slide in some measure theory stuff in here, hoping that no one notices or runs away)? Etc.

  3. 我们使用什么样的更简单的量或函数来近似不规则的实体,以及使用这些量而不是真实的函数对我们有什么好处?如果已经有一些其他流行的近似值,那么这些近似值与其他近似值相比如何?

  4. What kind of easier quantities or functions are we using to approximate the unruly entities, and how does using these quantities instead of the true function benefit us? How do these approximations fare against other approximations, if there are already some other popular approximations?

  5. 近似在什么意义上发生,这意味着当我们说我们可以近似时 F tre 使用 F ApprXAte ,我们如何准确测量之间的距离 F tre F ApprXAte ?回想一下,在数学中,我们可以通过多种方式测量物体的大小,包括距离。那么我们到底使用哪种方法来进行特定的近似呢?这是我们听到欧几里得范数、统一范数、最高范数、L 2范数等的地方。范数(大小)与距离有什么关系?规范会产生距离。这是直观的:如果我们的空间允许我们谈论物体的大小,那么它最好也允许我们谈论距离。

  6. In what sense is the approximation happening, meaning that when we say we can approximate f true using f approximate , how exactly are we measuring the distance between f true and f approximate ? Recall that in math we can measure sizes of objects, including distances, in many ways. So exactly which way are we using for our particular approximations? This is where we hear about the Euclidean norm, uniform norm, supremum norm, L2 norm, etc. What do norms (sizes) have to do with distances? A norm induces a distance. This is intuitive: if our space allows us to talk about sizes of objects, then it better allow us talk about distances as well.

示例 1:用有理数逼近无理数

Example 1: Approximating irrational numbers with rational numbers

任何无理数可以用有理数来近似,达到我们想要的任何精度。有理数表现良好且有用,因为它们只是整数对。我们的大脑可以轻松地凭直觉了解整数和分数。无理数则恰恰相反。也许在六年级时,你是否曾被要求计算 47 = 6 8556546 没有计算器,然后一直做下去,直到得到明确的答案?我有。相当卑鄙!甚至计算器和计算机也用有理数来近似无理数。但我不得不坐在那里思考我可以继续写数字,直到找到模式或计算终止。当然,这两种情况都没有发生,大约 30 位数字之后,我了解到有些数字只是无理数。

Any irrational number can be approximated by a rational number, up to any precision that we desire. Rational numbers are so well-behaved and useful, since they are just pairs of whole numbers. Our minds can easily intuit about whole numbers and fractions. Irrational numbers are quite the opposite. Have you ever been asked, maybe in grade 6, to calculate 47 = 6 . 8556546 . . . without a calculator, and stay at it until you had a definite answer? I have. Pretty mean! Even calculators and computers approximate irrational numbers with rationals. But I had to sit there thinking I could keep writing digits until I either found a pattern or the computation terminated. Of course neither happened, and around 30 digits later, I learned that some numbers are just irrational.

有不止一种方法可以编写量化该近似值的数学语句。它们都是等价且有用的:

There is more than one way to write a mathematical statement quantifying this approximation. They are all equivalent and useful:

可以使近似实体任意接近真实数量
The approximating entity can be made arbitrarily close to the true quantity

这是最直观的方式。

给定一个无理数s和任意精度 ε ,无论多小,我们都可以在一定距离内找到一个有理数q ε 来自s

| s - q | < ε

这意味着有理数和无理数在实线上任意接近 。这就引入了稠密的概念。

This is the most intuitive way.

Given an irrational number s and any precision ϵ , no matter how small, we can find a rational number q within a distance ϵ from s:

| s - q | < ϵ

This means that rational and irrational numbers live arbitrarily close to each other on the real line . This introduces the idea of denseness.

密度和封闭性
Denseness and closure

近似实体在真实数量所在的空间中是密集的。

这意味着,如果我们只关注近似成员的空间,然后添加它们所有序列的所有极限,我们就得到了真实成员的整个空间。将某个空间 S 的所有极限点相加称为该空间的闭运算,或取其闭包 S 。例如,当我们添加到开区间 A , 其极限点ab,我们得到区间 [a,b]。从而关闭_ A , 是[a,b]。我们写 A, = [a,b]。

有理数集 在实线上是稠密的 。换句话说,关闭 。我们写 =

Approximating entities are dense in the space where the true quantities live.

This means that if we focus only on the space of approximating members, then add in all the limits of all their sequences, we get the whole space of the true members. Adding in all the limiting points of a certain space S is called closing the space, or taking its closure, S ¯ . For example, when we add to the open interval ( a , b ) its limit points a and b, we get the closed interval [a,b]. Thus the closure of ( a , b ) is [a,b]. We write (a,b) ¯ = [a,b].

The set of rational numbers is dense in the real line . In other words, the closure of is . We write ¯ = .

序列的极限
Limits of sequences

真实数量是近似实体序列的极限。

在上一个项目符号中添加极限点的想法引入了使用序列及其极限术语的近似值。

在有理数逼近无理数的情况下,我们可以这样写:对于任何无理数s,都有一个序列 q n 有理数使得 n无穷大 q n = s 。这让我们有机会写出最著名的无理数最喜欢的定义之一作为示例:e = 2.71828182…

n无穷大 1+1 n n = e

无理e是有理数列的极限 1+1 1 1 , 1+1 2 2 , 1+1 3 3 , 这相当于 2 , 2 25 , 2 370370 ,

无论我们使用任意接近的概念、稠密性和封闭性概念还是序列的极限概念来用有理数逼近无理数,数学陈述中涉及的任何距离都是使用通常的欧几里得范数来测量的: d s , q = | s - q | ,这是两个数字之间的正常距离。

The true quantity is the limit of a sequence of the approximating entities.

The idea of adding in the limit points in the previous bullet introduces approximation using the terminology of sequences and their limits.

In the context of rational numbers approximating irrational numbers, we can therefore write: for any irrational number s, there is a sequence q n of rational numbers such that lim n q n = s . This gives us the chance to write as an example one of the favorite definitions of the most famous irrational number: e = 2.71828182…​

lim n (1+1 n) n = e

That is, the irrational number e is the limit of the sequence of rational numbers (1+1 1) 1 , (1+1 2) 2 , (1+1 3) 3 , which is equivalent to 2 , 2 . 25 , 2 . 370370 . . . , .

Whether we approximate an irrational number with a rational number using the arbitrarily close concept, the denseness and closure concepts, or the limits of sequences concept, any distance involved in the mathematical statements is measured using the usual Euclidean norm: d ( s , q ) = | s - q | , which is the normal distance between two numbers.

亲密关系声明需要附有特定的规范

Closeness Statements Need to Be Accompanied by a Specific Norm

我们可能会想:如果我们改变规范怎么办?近似性质仍然成立吗?如果我们使用除通常的欧几里得范数之外的其他距离定义来测量无理数之间的距离,我们仍然可以使用有理数来近似无理数吗?欢迎来到数学分析。一般来说,答案是否定的。使用某种范数时,数量可以彼此接近,而使用另一种范数时,数量可以彼此相距很远。因此,在数学中,当我们说数量彼此接近、近似或在某处收敛时,我们需要提及伴随的范数,以便查明这些接近性陈述在什么意义上正在发生。

We might wonder: what if we change the norm? Would the approximation property still hold? Can we still approximate irrationals using rationals if we measure the distance between them using some other definition of distance than the usual Euclidean norm? Welcome to mathematical analysis. In general, the answer is no. Quantities can be close to each other using some norm and very far using another norm. So in mathematics, when we say that quantities are close to each other, approximate others, or converge somewhere, we need to mention the accompanying norm in order to pinpoint in what sense these closeness statements are happening.

示例 2:用多项式逼近连续函数

Example 2: Approximating continuous functions with polynomials

连续函数可以是任何东西。一个孩子可以在一张纸上画一条蜿蜒的线,这将是一个没有人知道其公式的连续函数。另一方面,多项式是一种特殊类型的连续函数,非常容易评估、微分、积分、解释和计算。多项式函数中涉及的唯一运算是幂、标量乘法、加法和减法。n次多项式有一个简单的公式:

Continuous functions can be anything. A child can draw a wiggly line on a piece of paper and that would be a continuous function that no one knows the formula of. Polynomials, on the other hand, are a special type of continuous function that are extremely easy to evaluate, differentiate, integrate, explain, and do computations with. The only operations involved in polynomial functions are powers, scalar multiplication, addition, and subtraction. A polynomial of degree n has a simple formula:

p n X = A 0 + A 1 X + A 2 X 2 + A 3 X 3 + + A n X n

哪里的 A 是标量。自然地,非常希望能够使用多项式函数来近似非多项式连续函数。好消息是我们可以以任何精度 ε 。这是数学分析中的经典结果,称为维尔斯特拉斯近似定理:

where the a i ’s are scalar numbers. Naturally, it is extremely desirable to be able to approximate nonpolynomial continuous functions using polynomial functions. The wonderful news is that we can, up to any precision ϵ . This is a classical result in mathematical analysis, called the Weierstrass Approximation Theorem:

假设f是定义在实区间 [a,b] 上的连续实值函数。对于任何精度 ε > 0 ,存在一个多项式 p n 这样对于 [a,b] 中的所有x,我们有 | F X - p n X | < ε ,或等价地,最高范数 F - p n < ε

Suppose f is a continuous real-valued function defined on a real interval [a,b]. For any precision ϵ > 0 , there exists a polynomial p n such that for all x in [a,b], we have | f ( x ) - p n ( x ) | < ϵ , or equivalently, the supremum norm f - p n < ϵ .

请注意,与我们讨论的使用有理数近似无理数的原理相同的原理也适用于此。该定理断言我们总能找到任意接近连续函数的多项式,这意味着多项式集合在区间 [a,b] 上的连续函数空间中是稠密的;或者等价地,对于任何连续函数f ,我们可以找到收敛于f的多项式函数序列(因此f是多项式序列的极限)。在同一事实的所有这些变体中,距离都是相对于最高范数来测量的。在图4-6中,我们验证了连续函数 X 是多项式函数序列的极限 { X , X - X 3 3 , X - X 3 3 + X 5 5 , }

Note that the same principle as the one that we discussed for using rational numbers to approximate irrationals applies here. The theorem asserts that we can always find polynomials that are arbitrarily close to a continuous function, which means that the set of polynomials is dense in the space of continuous functions over the interval [a,b]; or equivalently, for any continuous function f, we can find a sequence of polynomial functions that converge to f (so f is the limit of a sequence of polynomials). In all of these variations of the same fact, the distances are measured with respect to the supremum norm. In Figure 4-6, we verify that the continuous function sin x is the limit of the sequence of the polynomial functions { x , x - x 3 3! , x - x 3 3! + x 5 5! , } .

埃麦0406
图 4-6。连续函数的近似 X 通过一系列多项式

神经网络万能逼近定理的陈述

Statement of the universal approximation theorem for neural networks

现在我们了解了近似的原理,让我们介绍一下神经网络的最新近似定理。

Now that we understand the principles of approximation, let’s state the most recent approximation theorems for neural networks.

回想一下,神经网络是将训练函数表示为计算图。我们希望这个训练函数能够很好地逼近生成数据的未知函数。这使我们能够使用训练函数而不是我们不知道并且可能永远不会知道的底层真实函数来进行预测。以下近似定理断言神经网络可以将基础函数近似到任意精度。当我们将这些定理的陈述与前面两个关于无理数和连续函数的例子进行比较时,我们注意到它们是同一类数学陈述。

Recall that a neural network is the representation of the training function as a computational graph. We want this training function to approximate the unknown function that generates the data well. This allows us to use the training function instead of the underlying true function, which we do not know and probably will never know, to make predictions. The following approximation theorems assert that neural networks can approximate the underlying functions up to any precision. When we compare the statements of these theorems to the two previous examples on irrational numbers and continuous functions, we notice that they are the same kind of mathematical statements.

以下结果来自Hornik、Stincombe 和 White (1989):设f是紧集 K 上的连续函数(这是数据背后的真实但未知的函数),其输出为 d 。然后:

The following result is from Hornik, Stinchombe, and White (1989): let f be a continuous function on a compact set K (this is the true but unknown function underlying the data) whose outputs are in d . Then:

任意接近
Arbitrarily close

存在一个前馈神经网络,只有一个隐藏层,它均匀地将f逼近到任意范围内 ε > 0 在 K 上。

There exists a feed forward neural network, having only a single hidden layer, which uniformly approximates f to within an arbitrary ϵ > 0 on K.

密度
Denseness

具有规定的非线性激活以及取决于d的神经元和层数的界限的神经网络集在均匀拓扑中是密集的 C K , d

The set of neural networks, with prescribed nonlinear activations and bounds on the number of neurons and layers depending on d, is dense in the uniform topology of C ( K , d ) .

在同一事实的两种变体中,距离是根据连续函数的最高范数来测量的。

In both variations of the same fact, the distances are measured with respect to the supremum norm on continuous functions.

证明需要测度论泛函分析中的数学概念。我们将在第 11 章概率论中介绍测度论。现在我们只列出证明所需的内容,而没有任何细节:Borel 和 Radon 测量、Hahn-Banach 定理和 Riesz 表示定理。

The proof needs mathematical concepts from measure theory and functional analysis. We will introduce measure theory in Chapter 11 on probability. For now we only list what is needed for the proof without any details: Borel and Radon measures, Hahn-Banach theorem, and Riesz representation theorem.

深度学习的逼近理论

Approximation Theory for Deep Learning

只有我们动机逼近理论并阐述了其对于深度学习的主要成果之一。为了获得更多信息和更深入的讨论,我们指出了最先进的结果,例如神经网络学习概率分布的能力、巴伦定理、神经正切核

We only motivated approximation theory and stated one of its main results for deep learning. For more information and a deeper discussion, we point to the state-of-the-art results such as the ability of neural networks to learn probability distributions, Barron’s theorem, the neural tangent kernel, and others.

损失函数

Loss Functions

虽然本章我们从第3章的传统机器学习过渡到深度学习时代,训练函数、损失函数、优化的结构仍然是一模一样的。用于神经网络的损失函数与第 3 章中讨论的损失函数没有什么不同,因为损失函数的目标没有改变:捕获真实值与训练函数做出的预测之间的误差。在深度学习中,神经网络代表训练函数,而对于前馈神经网络,我们看到这只不过是一系列线性组合,后面跟着非线性激活函数的组合。

Even though in this chapter we transitioned from Chapter 3’s traditional machine learning to the era of deep learning, the structure of the training function, loss function, and optimization is still exactly the same. The loss functions used for neural networks are not different from those discussed in Chapter 3, since the goal of a loss function has not changed: to capture the error between the ground truth and prediction made by the training function. In deep learning, the neural network represents the training function, and for feed forward neural networks, we saw that this is nothing more than a sequence of linear combinations followed by compositions with nonlinear activation functions.

深度学习中最流行的损失函数仍然是回归任务的均方误差和分类任务的交叉熵函数。返回第 3 章,了解这些函数的详细解释。

The most popular loss functions used in deep learning are still the mean squared error for regression tasks and the cross-entropy function for classification tasks. Go back to Chapter 3 for a thorough explanation of these functions.

我们有时会在该领域遇到其他损失函数。当我们遇到新的损失函数时,通常模型的设计者有一定的理由比其他更流行的损失函数更喜欢它,因此请确保您了解他们在特定设置中使用特定损失函数的基本原理。理想情况下,一个好的损失函数会惩罚错误的预测,计算成本并不昂贵,并且具有易于计算的导数。我们需要这个导数,以便我们的优化方法表现良好。正如我们在第 3 章中所讨论的,具有良好导数的函数比具有不连续导数的函数具有更平滑的地形,因此在优化过程中\在搜索损失函数的最小值时更容易导航。

There are other loss functions that we sometimes come across in the field. When we encounter a new loss function, usually the designers of the model have a certain reason to prefer it over the other more popular ones, so make sure you go through their rationale for using that specific loss function for their particular setup. Ideally, a good loss function penalizes bad predictions, is not expensive to compute, and has one derivative that is easy to compute. We need this derivative so that our optimization method behaves well. As we discussed in Chapter 3, functions with one good derivative have smoother terrains than functions with discontinuous derivatives, and hence are easier to navigate during the optimization process\when searching for minimizers of the loss function.

交叉熵函数、对数似然函数和 KL 散度

Cross-Entropy Function, log-likelihood Function, and KL Divergence

最小化交叉熵损失函数与最大化对数似然函数相同;概率分布的 KL 散度密切相关。回想一下,交叉熵函数借用自信息论和统计力学,它量化数据的真实(经验)分布与神经网络训练函数产生的(预测)分布之间的交叉熵。交叉熵函数的符号为负, 日志 其公式中的函数。最小化函数的负号与最大化不带负号的同一个函数是一样的,所以有时你会在该领域遇到以下语句:最大化对数似然函数 ,对我们来说相当于最小化交叉熵损失函数。一个密切相关的概念是Kullback-Leibler 散度,也称为KL 散度。有时,就像我们生成图像或机器音频的情况一样,我们需要学习概率分布,而不是确定性函数。在这种情况下,我们的损失函数应该捕获数据的真实概率分布和学习的概率分布之间的差异(我不会说距离,因为它的数学公式不是距离度量)。KL散度是这种损失函数的一个例子,它量化了当使用学习到的分布来近似真实分布时丢失的信息量,或者是真实分布相对于学习到的分布的相对熵分配。

Minimizing the cross-entropy loss function is the same as maximizing the log-likelihood function; KL divergence for probability distributions is closely related. Recall that the cross-entropy function is borrowed from information theory and statistical mechanics, and it quantifies the cross entropy between the true (empirical) distribution of the data and the distribution (of predictions) produced by the neural network’s training function. The cross-entropy function has a negative sign and a log function in its formula. Minimizing the minus of a function is the same as maximizing the same function without the minus sign, so sometimes you encounter the following statement in the field: maximizing the log likelihood function, which for us is equivalent to minimizing the cross-entropy loss function. A closely related concept is the Kullback-Leibler divergence, also called KL divergence. Sometimes, as in the cases where we generate images or machine audio, we need to learn a probability distribution, not a deterministic function. Our loss function in this case should capture the difference (I will not say distance since its mathematical formula is not a distance metric) between the true probability distribution of the data and the learned probability distribution. KL divergence is an example of such a loss function that quantifies the amount of information lost when the learned distribution is used to approximate the true distribution, or the relative entropy of the true distribution with respect to the learned distribution.

优化

Optimization

忠实于我们的训练函数损失函数优化数学结构,现在我们讨论优化步骤。我们的目标是对损失函数的情况进行有效的搜索 L ω 找到最小化 ω 的。请注意,当我们之前明确编写神经网络训练函数的公式时,我们将 ω 矩阵W中的权重和向量中的偏差 ω 0 。在本节中,为了简化符号并保持对数学的关注,我们将所有权重和偏差放在一个很长的向量中 ω 。也就是说,我们将损失函数写为 L ω ,而实际上,对于具有h 个隐藏层的全连接神经网络,它是:

Faithful to our training function, loss function, and optimization mathematical structure, we now discuss the optimization step. Our goal is to perform an efficient search of the landscape of the loss function L ( ω ) to find the minimizing ω ’s. Note that when we previously explicitly wrote formulas for the training functions of the neural network, we bundled up the ω weights in matrices W, and the biases in vectors ω 0 . In this section, for the sake of simplifying notation and to keep the focus on the mathematics, we put all the weights and biases in one very long vector ω . That is, we write the loss function as L ( ω ) , while in reality, for a fully connected neural network with h hidden layers, it is:

损失 功能 = L 1 , ω 0 1 , 2 , ω 0 2 , , H+1 , ω 0 H+1

仅当我们使用反向传播显式计算损失函数的导数时才需要该表示,我们将在本章后面介绍。

We only need that representation when we explicitly compute the derivative of the loss function using backpropagation, which we will cover later in this chapter.

对于深度学习来说,数量 ω 是向量中的 ω 可以非常高,如数万、数百万甚至数十亿。OpenAI 的自然语言 GPT-2拥有 15 亿个参数,并在 800 万个网页的数据集上进行了训练。我们需要解决这些未知数!想想并行计算,或者数学和算法流水线。

For deep learning, the number of ω ’s in the vector ω can be extremely high, as in tens of thousands, millions, or even billions. OpenAI’s GPT-2 for natural language has 1.5 billion parameters, and was trained on a data set of eight million web pages. We need to solve for these many unknowns! Think parallel computing, or mathematical and algorithmic pipelining.

使用优化方法,例如需要计算许多未知数中损失函数的二阶导数矩阵的牛顿型方法,即使以我们当前强大的计算能力也是不可行的。这是一个很好的例子,其中数值方法的数学理论工作得很好,但对于计算和现实生活中的实现来说是不切实际的。这里令人遗憾的是,使用二阶导数的数值优化方法通常比仅使用一阶导数的数值优化方法收敛得更快,因为它们利用了有关函数凹性(其碗的形状)的额外知识,而相反仅使用一阶导数提供的有关函数是增加还是减少的信息。在我们发明更强大的计算机之前,我们必须满足于仅使用损失函数相对于未知数的一个导数的一阶方法 ω 的。这些是梯度下降型方法,幸运的是,它们对于目前部署在我们日常生活中的许多现实生活中的人工智能系统(例如亚马逊的 Alexa)来说表现得非常好。

Using optimization methods, such as Newton-type methods that require computing matrices of second derivatives of the loss function in that many unknowns, is simply unfeasible even with our current powerful computational abilities. This is a great example where the mathematical theory of a numerical method works perfectly fine but is impractical for computational and real-life implementation. The sad part here is that numerical optimization methods that use the second derivative usually converge faster than those that use only the first derivative, because they take advantage of the extra knowledge about the concavity of the function (the shape of its bowl), as opposed to only using the information on whether the function is increasing or decreasing that the first derivative provides. Until we invent even more powerful computers, we have to satisfy ourselves with first-order methods that use only one derivative of the loss function with respect to the unknown ω ’s. These are the gradient-descent-type methods, and luckily, they perform extremely well for many real-life AI systems that are currently deployed for use in our everyday life, such as Amazon’s Alexa.

数学与神经网络的神秘成功

Mathematics and the Mysterious Success of Neural Networks

值得的在这里暂停一下,反思一下神经网络的成功,在本节的上下文中,这意味着:我们为损失函数找到最小化器的能力 L ω 这使得训练函数能够很好地推广到新的和未见过的数据。我没有北美口音,亚马逊的 Alexa 可以很好地理解我的意思。从数学上讲,神经网络的这种成功仍然令人费解,原因如下:

It is worth pausing here to reflect on the success of neural networks, which in the context of this section translates to: our ability to locate a minimizer for the loss function L ( ω ) that makes the training function generalize to new and unseen data really well. I do not have a North American accent, and Amazon’s Alexa understands me perfectly fine. Mathematically, this success of neural networks is still puzzling for various reasons:

  • 损失函数的 ω -领域 L ω ,其中发生对最小值的搜索,是非常高维的(可以达到数十亿维)。我们有数十亿甚至数万亿的选择。我们如何找到合适的人?

  • The loss function’s ω -domain L ( ω ) , where the search for the minimum is happening, is very high-dimensional (can reach billions of dimensions). We have billions or even trillions of options. How are we finding the right one?

  • 损失函数本身的情况是非凸的,因此它有一堆局部最小值和鞍点,优化方法可能会陷入困境或收敛到错误的局部最小值。再说一遍,我们如何找到合适的人?

  • The landscape of the loss function itself is nonconvex, so it has a bunch of local minima and saddle points where optimization methods can get stuck or converge to the wrong local minimum. Again, how are we finding the right one?

  • 在一些人工智能应用中,例如计算机视觉,还有更多 ω 比数据点(图像)。回想一下,对于图像来说,每个像素都是一个特征,所以这已经是很多了 ω 仅在输入级别。对于这样的应用程序,还有更多的未知数( ω 的)而不是确定它们(数据点)所需的信息。从数学上来说,这是一个欠定系统,并且这样的系统有无限多种可能的解决方案!那么我们的网络优化方法到底是如何选择好的解决方案的呢?那些概括性好的?

  • In some AI applications, such as computer vision, there are much more ω ’s than data points (images). Recall that for images, each pixel is a feature, so that is already a lot of ω ’s only at the input level. For such applications, there are much more unknowns (the ω ’s) than information required to determine them (the data points). Mathematically, this is an under-determined system, and such systems have infinitely many possible solutions! So exactly how is the optimization method for our network picking up on the good solutions? The ones that generalize well?

这种神秘的成功部分归功于训练过程中已成为主要技术的技术,例如正则化(本章稍后讨论)、验证、测试等。然而,深度学习仍然缺乏坚实的理论基础。这就是为什么许多数学家最近聚集在一起回答这些问题。这美国国家科学基金会(NSF)在这个方向上的努力,以及我们接下来从其公告中复制的引文内容非常丰富,并且对数学如何与先进的人工智能交织在一起提供了深刻的见解:

Some of this mysterious success is attributed to techniques that have become a staple during the training process, such as regularization (discussed later in this chapter), validation, testing, etc. However, deep learning still lacks a solid theoretical foundation. This is why a lot of mathematicians have recently converged to answer such questions. The efforts of the National Science Foundation (NSF) in this direction, and the quotes that we copy next from its announcements are quite informative and give great insight on how mathematics is intertwined with advancing AI:

NSF 最近成立了 11 个新的人工智能研究机构,以推进人工智能在各个领域的发展,例如人机交互与协作、优化方面的人工智能、人工智能和先进的网络基础设施、计算机和网络系统中的人工智能、动态系统中的人工智能、人工智能增强学习以及人工智能驱动的农业和食品系统创新。NSF 能够汇集众多科学探究领域,包括计算机和信息科学与工程,以及认知科学和心理学、经济学和博弈论、工程和控制理论、伦理学、语言学、数学和哲学,这使得该机构处于独特的地位引领国家拓展人工智能前沿。NSF 的资助将帮助美国充分利用人工智能的潜力,以增强经济、促进就业增长并在未来几十年为社会带来福利。

The NSF has recently established 11 new artificial intelligence research institutes to advance AI in various fields, such as human-AI interaction and collaboration, AI for advances in optimization, AI and advanced cyberinfrastructure, AI in computer and network systems, AI in dynamic systems, AI-augmented learning, and AI-driven innovation in agriculture and the food system. The NSF’s ability to bring together numerous fields of scientific inquiry, including computer and information science and engineering, along with cognitive science and psychology, economics and game theory, engineering and control theory, ethics, linguistics, mathematics, and philosophy, uniquely positions the agency to lead the nation in expanding the frontiers of AI. NSF funding will help the U.S. capitalize on the full potential of AI to strengthen the economy, advance job growth, and bring benefits to society for decades to come.

以下内容引自 NSF 的深度学习数学和科学基础 (SCALE MoDL) 网络广播

The following is quoted from the NSF’s Mathematical and Scientific Foundations of Deep Learning (SCALE MoDL) webcast.

深度学习取得了令人印象深刻的实证成功,推动了基础科学发现并改变了人工智能的众多应用领域。然而,我们对该领域的理论理解不完整,阻碍了更广泛的参与者接触深度学习技术。面对我们对深度学习成功背后机制的不完全理解,应该有助于克服其局限性并扩大其适用性。这SCALE MoDL 计划将赞助由数学家、统计学家、电气工程师和计算机科学家组成的新研究合作。研究活动应集中于涉及深度学习数学和科学基础领域中一些最具挑战性的理论问题的明确主题。每次合作都应通过跨学科领域的近期博士学位获得者、研究生和/或本科生的研究参与来进行培训。这些提案可能涉及深度学习理论基础的广泛科学主题。可能的主题包括但不限于几何、拓扑、贝叶斯或博弈论公式,利用最优传输理论、优化理论、逼近论、信息论、动力系统、偏微分方程或平均场理论的分析方法,受应用程序启发的观点探索小数据集的有效训练、对抗性学习和闭合决策-行动循环,更不用说理解成功指标、隐私保护、因果推理和算法公平性。

Deep learning has met with impressive empirical success that has fueled fundamental scientific discoveries and transformed numerous application domains of artificial intelligence. Our incomplete theoretical understanding of the field, however, impedes accessibility to deep learning technology by a wider range of participants. Confronting our incomplete understanding of the mechanisms underlying the success of deep learning should serve to overcome its limitations and expand its applicability. The SCALE MoDL program will sponsor new research collaborations consisting of mathematicians, statisticians, electrical engineers, and computer scientists. Research activities should be focused on explicit topics involving some of the most challenging theoretical questions in the general area of Mathematical and Scientific Foundations of Deep Learning. Each collaboration should conduct training through research involvement of recent doctoral degree recipients, graduate students, and/or undergraduate students from across this multidisciplinary spectrum. A wide range of scientific themes on theoretical foundations of deep learning may be addressed in these proposals. Likely topics include but are not limited to geometric, topological, Bayesian, or game-theoretic formulations, to analysis approaches exploiting optimal transport theory, optimization theory, approximation theory, information theory, dynamical systems, partial differential equations, or mean field theory, to application-inspired viewpoints exploring efficient training with small data sets, adversarial learning, and closing the decision-action loop, not to mention foundational work on understanding success metrics, privacy safeguards, causal inference, and algorithmic fairness.

梯度下降 ω +1 = ω - η L ω

Gradient Descent ω i+1 = ω i - η L ( ω i )

在深度学习中广泛使用的用于优化的梯度下降方法非常简单,我们可以将其公式放入本小节的标题中。这就是梯度下降搜索损失函数的方式 L ω 对于局部最小值:

The widely used gradient descent method for optimization in deep learning is so simple that we could fit its formula in this subsection’s title. This is how gradient descent searches the landscape of the loss function L ( ω ) for a local minimum:

在某处初始化 ω 0
Initialize somewhere at ω 0

随机选取起始数值 ω 0 = ω 0 , ω 1 , , ω n 。这一选择将我们置于搜索空间和景观的某个位置 L ω 。这里有一个重要警告:我们从哪里开始很重要!不要用全零或全相等的数字进行初始化。这将削弱网络学习不同特征的能力,因为不同的节点将输出完全相同的数字。我们将很快讨论初始化。

Randomly pick starting numerical values for ω 0 = ( ω 0 , ω 1 , , ω n ) . This choice places us somewhere in the search space and at the landscape of L ( ω ) . One big warning here: where we start matters! Do not initialize with all zeros or all equal numbers. This will diminish the network’s ability to learn different features, since different nodes will output exactly the same numbers. We will discuss initialization shortly.

移动到一个新点 ω 1
Move to a new point ω 1

梯度下降沿与损失函数的梯度向量相反的方向移动 - L ω 0 。如果步长大小可以保证减小梯度 η ,也称为学习率,不会太大:

The gradient descent moves in the direction opposite to the gradient vector of the loss function - L ( ω 0 ) . This is guaranteed to decrease the gradient if the step size η , also called the learning rate, is not too large:

ω 1 = ω 0 - η L ω 0
移动到一个新点 ω 2
Move to a new point ω 2

同样,梯度下降沿与损失函数的梯度向量相反的方向移动 - L ω 1 。如果学习的话,这保证会降低梯度 η 不太大:

Again, the gradient descent moves in the direction opposite to the gradient vector of the loss function - L ( ω 1 ) . This is guaranteed to decrease the gradient if the learning η is not too large:

ω 2 = ω 1 - η L ω 1
继续下去,直到点的顺序 { ω 0 , ω 1 , ω 2 , } 收敛
Keep going until the sequence of points { ω 0 , ω 1 , ω 2 , } converges

请注意,在实践中,我们有时必须在清楚该序列已经收敛之前停下来,例如,当它由于平坦的景观而变得极其缓慢时。

Note that in practice, we sometimes have to stop before it is clear that this sequence has converged—for example, when it becomes painfully slow due to a flattening landscape.

图4-7展示了最小化某个损失函数 L ω 1 , ω 2 使用梯度下降。作为人类,我们仅限于存在的三维空间,因此我们无法想象超出三维空间的事物。这对我们的可视化而言是一个严重的限制,因为我们的损失函数通常作用于非常高维的空间。它们是许多函数的函数 ω 的,但只有当它们最多依赖于两个时,我们才能准确地可视化它们 ω 的。也就是说,我们可以将损失函数可视化 L ω 1 , ω 2 取决于两个 ω 的,但不是损失函数 L ω 1 , ω 2 , ω 3 取决于三个(或更多) ω 的。尽管我们可视化作用于高维空间的损失函数的能力受到严重限制,但图 4-7准确地描述了梯度下降法的一般运作方式。在图4-7中,搜索发生在二维中 ω 1 , ω 2 平面(图 4-7中的平坦地面),我们跟踪函数景观上的进度 L ω 1 , ω 2 嵌入在 3 。搜索空间的维度始终小于损失函数景观所嵌入的空间的维度。这使得优化过程变得更加困难,因为我们正在寻找平坦或压扁的地形(图 4-7中的地面)中繁忙景观的最小化。

Figure 4-7 shows minimizing a certain loss function L ( ω 1 , ω 2 ) using gradient descent. We as humans are limited to the three-dimensional space that we exist in, so we cannot visualize beyond three dimensions. This is a severe limitation for us in terms of visualization since our loss functions usually act on very high-dimensional spaces. They are functions of many ω ’s, but we can only visualize them accurately if they depend on at most two ω ’s. That is, we can visualize a loss function L ( ω 1 , ω 2 ) depending on two ω ’s, but not a loss function L ( ω 1 , ω 2 , ω 3 ) depending on three (or more) ω ’s. Even with this severe limitation on our capacity to visualize loss functions acting on high-dimensional spaces, Figure 4-7 gives an accurate picture of how the gradient descent method operates in general. In Figure 4-7, the search happens in the two-dimensional ( ω 1 , ω 2 ) plane (the flat ground in Figure 4-7), and we track the progress on the landscape of the function L ( ω 1 , ω 2 ) that is embedded in 3 . The search space always has one dimension less than the dimension of the space in which the landscape of the loss function is embedded. This makes the optimization process harder, since we are looking for a minimizer of a busy landscape in a flattened or squished version of its terrain (the ground level in Figure 4-7).

埃麦0407
图 4-7。两个梯度下降步骤。请注意,如果我们从山的另一边开始,我们将不会收敛到最小值。因此,当我们搜索非凸函数的最小值时,我们从哪里开始或如何启动 ω 很重要。

解释学习率超参数的作用 η

Explaining the Role of the Learning Rate Hyperparameter η

在每一个迭代,梯度下降法 ω +1 = ω - η L ω 让我们离开重点 ω 在搜索空间中到另一个点 ω +1 。梯度下降增加了 - η L ω 到目前的 ω 获得 ω +1 。数量 - η L ω 由标量组成 η 乘以梯度向量的负值 - L ω ,它指向损失函数从该点开始下降最速的方向 ω 。因此,缩放后的 - η L ω 告诉我们在搜索空间中要沿着最陡下降方向走多远才能选择下一个点 ω +1 。换句话说,向量 - L ω 指定我们将从当前点移开的方向以及标量 η ,称为学习率,控制我们沿着这个方向走多远。图 4-8显示了具有两种不同学习率的一步梯度下降 η 。太大的学习率可能会超出最小值并跨越山谷的另一边。另一方面,太小的学习率需要一段时间才能达到最小值。因此,需要在选择大学习率和冒着超过最小值的风险与选择小学习率和增加计算成本和收敛时间之间进行权衡。

At each iteration, the gradient descent method ω i+1 = ω i - η L ( ω i ) moves us from the point ω i in the search space to another point ω i+1 . The gradient descent adds - η L ( ω i ) to the current ω i to obtain ω i+1 . The quantity - η L ( ω i ) is made up of a scalar number η multiplied by the negative of the gradient vector - L ( ω i ) , which points in the direction of the steepest decrease of the loss function from the point ω i . Thus, the scaled - η L ( ω i ) tells us how far in the search space we are going to go along the steepest descent direction in order to choose the next point ω i+1 . In other words, the vector - L ( ω i ) specifies in which direction we will move away from our current point, and the scalar number η , called the learning rate, controls how far we are going to step along that direction. Figure 4-8 shows one step of gradient descent with two different learning rates η . Too large of a learning rate might overshoot the minimum and cross to the other side of the valley. On the other hand, too small of a learning rate takes a while to get to the minimum. So the trade-off is between choosing a large learning rate and risking overshooting the minimum, and choosing a small learning rate and increasing computational cost and time for convergence.

埃麦0408
图 4-8。具有两种不同学习率的一步梯度下降。在左边,学习率太大,因此梯度下降超过最小值(星点)并落在山谷的另一边。右侧的学习率很小,但是需要一段时间才能达到最小值(星点)。注意一点的梯度向量如何垂直于该点设置的水平。

学习率 η 是机器学习模型超参数的另一个例子。它不是训练函数公式中的权重之一。它是我们用来估计权重的算法的固有参数。训练功能。

The learning rate η is another example of a hyperparameter of a machine learning model. It is not one of the weights that goes into the formula of the training function. It is a parameter that is intrinsic to the algorithm that we employ to estimate the weights of the training function.

特征的规模影响梯度下降的性能

The scale of the features affects the performance of the gradient descent

这是一提前标准化功能的原因。标准化特征意味着从每个数据实例中减去平均值并除以标准差。这迫使所有数据值具有相同的尺度,均值为零,标准差为一,而不是具有截然不同的尺度,例如以百万为单位测量的特征,以 0.001 测量的另一个特征。但为什么这会影响梯度下降法的性能呢?请继续阅读。

This is one of the reasons to standardize the features ahead of time. Standardizing a feature means subtracting from each data instance the mean and dividing by the standard deviation. This forces all the data values to have the same scale, with mean zero and standard deviation one, as opposed to having vastly different scales, such as a feature measured in the millions and another measured in 0.001. But why does this affect the performance of the gradient descent method? Read on.

回想一下,输入特征的值乘以训练函数中的权重,训练函数又进入损失函数的公式中。输入特征的不同尺度会改变损失函数碗的形状,使最小化过程变得更加困难。图4-9显示了函数的水平集 L ω 1 , ω 2 = ω 1 2 + A ω 2 2 具有不同的a值,模仿不同尺度的输入特征。请注意,随着a值的增加,损失函数的水平集如何变得更加狭窄和拉长。这意味着损失函数的碗形状是一个狭长的山谷。

Recall that the values of the input features get multiplied by the weights in the training function, and the training function in turn enters into the formula of the loss function. Very different scales of the input features change the shape of the bowl of the loss function, making the minimization process harder. Figure 4-9 shows the level sets of the function L ( ω 1 , ω 2 ) = ω 1 2 + a ω 2 2 with different values of a, mimicking different scales of input features. Note how the level sets of loss function become much more narrow and elongated as the value of a increases. This means that the shape of the bowl of the loss function is a long, narrow valley.

埃麦0409
图 4-9。损失函数的水平集 L ω 1 , ω 2 = ω 1 2 + A ω 2 2 随着a值从 1 增加到 20 再到 40,变得更加狭窄和拉长

当梯度下降法试图在如此狭窄的山谷中运行时,它的点会从山谷的一侧跳到另一侧,在试图找到最小值时呈之字形,并大大减慢收敛速度。想象一下在到达梵蒂冈之前沿着罗马的所有街道蜿蜒而行,而不是直接乘坐直升机前往梵蒂冈。

When the gradient descent method tries to operate in such a narrow valley, its points hop from one side of the valley to the other, zigzagging as it tries to locate the minimum, and slowing down the convergence considerably. Imagine zigzagging along all the streets of Rome before arriving at the Vatican, as opposed to taking a helicopter straight to the Vatican.

但为什么会出现这种曲折的行为呢?函数梯度向量的一个标志是它垂直于该函数的水平集。因此,如果损失函数的山谷又长又窄,它的水平集几乎看起来像彼此平行的线,并且具有足够大的步长(学习率),我们实际上可以从山谷的一侧穿过到另一个,因为它太窄了。谷歌梯度下降之字形,你会看到很多图片说明这种行为。

But why does this zigzagging behavior happen? One hallmark of the gradient vector of a function is that it is perpendicular to the level sets of that function. So if the valley of the loss function is so long and narrow, its level sets almost look like lines that are parallel to each other, and with a large enough step size (learning rate), we can literally cross from one side of the valley to the other since it is so narrow. Google gradient descent zigzag and you will see many images illustrating this behavior.

即使对于狭窄而长的山谷(假设我们没有提前缩放输入特征值),锯齿形的一种解决方法是选择非常小的学习率,以防止梯度下降方法从山谷的一侧步进到另一个。然而,这会以自己的方式减慢到达最小值的速度,因为该方法只会在每次迭代时逐步步进。我们最终将从罗马到达梵蒂冈,但时间乌龟的步伐。

One fix for zigzagging, even with a narrow, long valley (assuming we did not scale the input feature values ahead of time), is to choose a very small learning rate, preventing the gradient descent method from stepping from one side of the valley to the other. However, that slows down the arrival to the minimum in its own way, since the method will step only incrementally at each iteration. We will eventually arrive at the Vatican from Rome, but at a turtle’s pace.

在损失函数景观的最小值(局部和/或全局)、平坦区域或鞍点附近,梯度下降法会爬行

Near the minima (local and/or global), flat regions, or saddle points of the loss function’s landscape, the gradient descent method crawls

梯度下降法更新当前点 ω 通过添加向量 - η L ω 。因此,从该点开始的步长的精确长度 ω 负梯度向量的方向是 η 乘以梯度向量的长度 L ω 。在损失函数的最小值、最大值、鞍点或任何平坦区域,梯度向量为零,因此其长度也为零。这意味着在最小值、最大值、鞍点或任何平坦区域附近,梯度下降法的步长变得非常小,并且该方法显着减慢。如果这种情况发生在最小值附近,则不必太担心,因为这可以用作停止标准,除非该最小值是距离全局最小值非常远的局部最小值。另一方面,如果它发生在平坦区域或鞍点附近,那么该方法将在那里卡住一段时间,这是不希望的。一些从业者将学习率 η 按计划,在优化过程中更改其值。当我们研究这些时,我们注意到目标是避免爬行、节省计算时间并加速收敛。

The gradient descent method updates the current point ω i by adding the vector - η L ( ω i ) . Therefore, the exact length of the step from the point ω i in the direction of the negative gradient vector is η multiplied by the length of the gradient vector L ( ω i ) . At a minimum, maximum, saddle point, or any flat region of the landscape of the loss function, the gradient vector is zero, hence its length is zero as well. This means that near a minimum, maximum, saddle point, or any flat region, the step size of the gradient descent method becomes very small, and the method slows down significantly. If this happens near a minimum, then there is not much worry since this can be used as a stopping criterion, unless this minimum is a local minimum very far from the global minimum. If, on the other hand, it happens in a flat region or near a saddle point, then the method will get stuck there for a while, and that is undesirable. Some practitioners put the learning rate η on a schedule, changing its value during the optimization process. When we look into these, we notice that the goals are to avoid crawling, save computational time, and speed up convergence.

我们将在本章后面讨论随机(随机)梯度下降。由于这种方法的随机性,这些点会频繁地跳跃,而不是沿着更一致的路线到达最小值。在我们陷入困境的情况下,例如鞍点或局部最小值,这对我们有利,因为我们可能会随机地被推出局部最小值或远离鞍点,进入景观的一部分,并有更好的路线通向目标。最低限度。

We will discuss stochastic (random) gradient descent later in this chapter. Due to the random nature of this method, the points hop around a lot, as opposed to following a more consistent route toward the minimum. This works to our advantage in situations where we are stuck, such as saddle points or local minima, since we might get randomly propelled out of the local minimum or away from the saddle point into a part of the landscape with a better route toward the minimum.

凸面与非凸面景观

Convex Versus Nonconvex Landscapes

我们不可以有一个优化章节没有讨论凸性。事实上,整个数学领域都专门致力于凸优化。同样重要的是要立即注意到神经网络的优化通常是非凸的

We cannot have an optimization chapter without discussing convexity. In fact, entire mathematical fields are dedicated solely to convex optimization. It is equally important to immediately note that optimization for neural networks is in general, nonconvex.

当我们使用非凸激活函数时,例如图 4-5第一行中的 sigmoid 型函数,所得神经网络中涉及的损失函数的景观不是凸的。这就是为什么我们花大量时间讨论陷入局部极小值、平坦区域和鞍点的原因,而对于凸面景观我们不会担心这些。凸和非凸景观之间的对比在图 4-10(显示凸损失函数及其水平集)以及图4-114-12(显示非凸函数及其水平集)中很明显。

When we use nonconvex activation functions, such as the sigmoid-type functions in the first row of Figure 4-5, the landscapes of the loss functions involved in the resulting neural networks are not convex. This is why we spend a good amount of time talking about getting stuck at local minima, flat regions, and saddle points, which we wouldn’t worry about for convex landscapes. The contrast between convex and nonconvex landscapes is obvious in Figure 4-10, which shows a convex loss function and its level sets, and Figures 4-11 and 4-12, which show nonconvex functions and their level sets.

埃麦0410
图 4-10。三维凸函数及其水平集图。梯度向量位于同一空间( 2 )作为级别设置,而不是在 3
埃迈 0411
图 4-11。三维非凸函数及其水平集的图。梯度向量位于同一空间( 2 )作为级别设置,而不是在 3
埃迈 0412
图 4-12。三维非凸函数及其水平集的图。梯度向量位于同一空间( 2 )作为级别设置,而不是在 3

当我们在整个网络中使用凸激活函数(例如图4-5第二行中的ReLU型函数)和损失函数时,我们仍然会得到非凸优化问题,因为两个凸函数的组合不一定是凸的。如果损失函数恰好是非减且凸的,那么它与凸函数的组合就是凸的。神经网络中流行的损失函数,例如均方误差、交叉熵和铰链损失,都是凸函数,但不是非减函数。

When we use convex activation functions throughout the network, such as the ReLU-type functions in the second row of Figure 4-5, and convex loss functions, we can still end up with a nonconvex optimization problem, because the composition of two convex functions is not necessarily convex. If the loss function happens to be nondecreasing and convex, then its composition with a convex function is convex. The loss functions that are popular for neural networks, such as mean squared error, cross-entropy, and hinge loss, are all convex but not nondecreasing.

熟悉凸优化的核心概念非常重要。如果您不知道从哪里开始,请记住,当线性过于简单或不可用时,凸性会取代线性,然后学习以下所有内容(当我们讨论运筹学时,这将与人工智能、深度学习和强化学习联系在一起)第10章):

It is important to become familiar with central concepts from convex optimization. If you do not know where to start, keep in mind that convexity replaces linearity when linearity is too simplistic or unavailable, then learn everything about the following (which will be tied to AI, deep learning, and reinforcement learning when we discuss operations research in Chapter 10):

  • 线性函数的最大值是凸的

  • Max of linear functions is convex

  • 最大-最小和最小-最大

  • Max-min and min-max

  • 鞍点

  • Saddle points

  • 两人零和博弈

  • Two-player zero-sum games

  • 二元性

  • Duality

由于凸优化是一个发展良好且易于理解的领域(至少比神经网络的数学基础更多),并且神经网络在数学上还有很长的路要走,如果我们能够利用我们关于凸性的知识,那就太好了为了更深入地了解神经网络。该领域的研究正在进行中。例如,在 2020 年题为“两层 ReLU 网络的凸几何:隐式自动编码和可解释模型”的论文中,Tolga Ergen 和 Mert Pilanci 将训练两层 ReLU 网络的问题框定为凸分析优化问题以下为论文摘要:

Since convex optimization is such a well-developed and understood field (at least more than the mathematical foundations for neural networks), and neural networks still have a long way to go mathematically, it would be nice if we could exploit our knowledge about convexity in order to gain a deeper understanding of neural networks. Research in this area is ongoing. For example, in a 2020 paper titled “Convex Geometry of Two-Layer ReLU Networks: Implicit Autoencoding and Interpretable Models,” Tolga Ergen and Mert Pilanci frame the problem of training two layered ReLU networks as a convex analytic optimization problem. The following is the abstract of the paper:

我们开发了一个凸分析框架ReLU 神经网络阐明了隐藏神经元的内部工作原理及其函数空间特征。我们证明了神经网络中的修正线性单元充当凸正则化器,其中通过某个凸集的极值点鼓励简单的解决方案。对于一维回归和分类,我们证明具有范数正则化的有限两层 ReLU 网络可以产生线性样条插值。在更一般的高维情况下,我们表明两层网络的训练问题可以转化为具有无限多个约束的凸优化问题。然后,我们提供一系列凸松弛来近似解决方案,并提供切割平面算法来改进松弛。我们推导出松弛精确性的条件,并为某些情况下的最佳神经网络权重提供简单的封闭式公式。我们的结果表明,ReLU 网络的隐藏神经元可以解释为输入层的凸自编码器。我们还建立了一个连接 0 - 1 神经网络的等价性类似于压缩感知中的最小基数解。大量的实验结果表明,所提出的方法产生了可解释且准确的结果楷模。

We develop a convex analytic framework for ReLU neural networks which elucidates the inner workings of hidden neurons and their function space characteristics. We show that rectified linear units in neural networks act as convex regularizers, where simple solutions are encouraged via extreme points of a certain convex set. For one dimensional regression and classification, we prove that finite two-layer ReLU networks with norm regularization yield linear spline interpolation. In the more general higher dimensional case, we show that the training problem for two-layer networks can be cast as a convex optimization problem with infinitely many constraints. We then provide a family of convex relaxations to approximate the solution, and a cutting-plane algorithm to improve the relaxations. We derive conditions for the exactness of the relaxations and provide simple closed form formulas for the optimal neural network weights in certain cases. Our results show that the hidden neurons of a ReLU network can be interpreted as convex autoencoders of the input layer. We also establish a connection to l 0 - l 1 equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing. Extensive experimental results show that the proposed approach yields interpretable and accurate models.

随机梯度下降

Stochastic Gradient Descent

迄今为止,训练前馈神经网络的进展如下:

So far, training a feed forward neural network has progressed as follows:

  1. 修复一组初始权重 ω 0 用于训练功能。

  2. Fix an initial set of weights ω 0 for the training function.

  3. 在训练子集中的所有数据点评估此训练函数。

  4. Evaluate this training function at all the data points in the training subset.

  5. 通过将训练子集中所有数据点的真实标签与训练函数做出的预测进行比较,计算训练子集中所有数据点的单独损失。

  6. Calculate the individual losses at all the data points in the training subset by comparing their true labels to the predictions made by the training function.

  7. 对训练子集中的所有数据执行此操作。

  8. Do this for all the data in the training subset.

  9. 平均所有这些个人损失。这个平均值就是损失函数。

  10. Average all these individual losses. This average is the loss function.

  11. 评估该损失函数在这组初始权重处的梯度。

  12. Evaluate the gradient of this loss function at this initial set of weights.

  13. 根据最速下降规则选择下一组权重。

  14. Choose the next set of weights according to the steepest descent rule.

  15. 重复直到收敛到某处,或者在由训练函数在验证集上的性能决定的一定次数的迭代后停止。

  16. Repeat until you converge somewhere, or stop after a certain number of iterations determined by the performance of the training function on the validation set.

这个过程的问题在于,当我们有一个包含数千个点的大型训练子集和一个包含数千个权重的神经网络时,评估训练函数、损失函数和损失函数的梯度的成本就太高了。训练子集中的所有数据点。补救措施是随机化过程:随机选择训练子集的一小部分来评估每一步的训练函数、损失函数和该损失函数的梯度。这极大地降低了计算成本。

The problem with this process is that when we have a large training subset with thousands of points, and a neural network with thousands of weights, it gets too expensive to evaluate the training function, the loss function, and the gradient of the loss function on all the data points in the training subset. The remedy is to randomize the process: randomly choose a very small portion of the training subset to evaluate the training function, loss function, and gradient of this loss function at each step. This slashes the computational cost dramatically.

不断重复对训练子集的小部分进行随机选择(原则上有替换,但实际上没有替换),直到收敛到某处,或者在由训练函数在验证集上的性能确定的一定次数的迭代后停止。整个训练子集的一次遍历称为一个 epoch

Keep repeating this random selection (in principle with replacement but in practice without replacement) of small portions of the training subset until you converge somewhere, or stop after a certain number of iterations determined by the performance of the training function on the validation set. One pass through the whole training subset is called one epoch.

随机梯度下降表现非常好,它已成为训练的主要内容神经网络。

Stochastic gradient descent performs remarkably well, and it has become a staple in training neural networks.

初始化权重 ω 0 用于优化过程

Initializing the Weights ω 0 for the Optimization Process

我们有已经确定使用所有零权重或所有相同权重进行初始化是一个非常糟糕的主意。下一个合乎逻辑的步骤,也是传统做法(2010 年之前),是在初始阶段选择权重 ω 0 随机地从小区间内的均匀分布(例如 [-1,1]、[0,1] 或 [-0.3,0.3])中采样,或者从具有预选均值和方差的高斯分布中采样。尽管尚未对此进行深入研究,但从经验证据看来,初始权重是从均匀分布还是高斯分布中采样并不重要,但初始权重的规模似乎确实很重要。优化过程的进展以及网络对未见过的数据进行良好泛化的能力。事实证明,在这方面有些选择比其他选择更好。目前,两种最先进的选择取决于激活函数的选择:是 sigmoid 型还是 ReLU 型。

We have already established that initializing with all zero weights or all the same weights is a really bad idea. The next logical step, and what was the traditional practice (before 2010), would be to choose the weights in the initial ω 0 randomly, sampled either from the uniform distribution over small intervals, such as [-1,1], [0,1], or [-0.3,0.3], or from the Gaussian distribution with a preselected mean and variance. Even though this has not been studied in depth, it seems from empirical evidence that it doesn’t matter whether the initial weights are sampled from the uniform distribution or Gaussian distribution, but it does seem that the scale of the initial weights matters when it comes to both the progress of the optimization process and the ability of the network to generalize well to unseen data. It turns out that some choices are better than others in this respect. Currently, the two state-of-the-art choices depend on the choice of the activation function: whether it is sigmoid-type or ReLU-type.

Xavier Glorot 初始化
Xavier Glorot initialization

这里,初始权重是从区间 [ - 6 n+ , 6 n+ ],其中n是节点的输入数量(例如,前一层中的节点数量),m是该层的输出数量(例如,当前层中的节点数量)。

Here, initial weights are sampled from uniform distribution over the interval [ - 6 n+m , 6 n+m ], where n is the number of inputs to the node (e.g., the number of nodes in the previous layer), and m is the number of outputs from the layer (e.g., the number of nodes in the current layer).

何凯明初始化
Kaiming He initialization

这里,初始权重是从均值为零、方差为 2/ n 的高斯分布中采样的,其中n是输入的数量节点。

Here, the initial weights are sampled from the Gaussian distribution with zero mean and variance 2/n, where n is the number of inputs to the node.

正则化技术

Regularization Techniques

正则化帮助我们对训练函数的权重做出良好的选择,同时避免过度拟合数据。我们希望训练后的函数能够跟踪数据中的信号而不是噪声,因此它可以很好地推广到看不见的数据。在这里,我们介绍了训练神经网络时使用的四种简单但流行的正则化技术:dropout、提前停止、批量归一化和权重衰减(岭、套索和弹性网络)正则化。

Regularization helps us arrive at a good choice for the weights of the training function while at the same time avoiding overfitting the data. We want our trained function to follow the signal in the data rather than the noise, so it can generalize well to unseen data. Here we include four simple yet popular regularization techniques that are used while training a neural network: dropout, early stopping, batch normalization, and weight decay (ridge, lasso, and elastic net) regularizations.

辍学

Dropout

减少一些训练期间从每一层随机选择神经元。通常,大约 20% 的输入层节点和大约一半的每个隐藏层节点被随机丢弃。输出层中没有节点被丢弃。Dropout 的部分灵感来自于基因繁殖,其中父母的一半基因被丢弃,并且存在一个小的随机突变。这具有同时训练不同网络(每层具有不同数量的节点)并平均其结果的效果,这通常会产生更可靠的结果。

Drop some randomly selected neurons from each layer during training. Usually, about 20% of the input layer’s nodes and about half of each of the hidden layers’ nodes are randomly dropped. No nodes from the output layer are dropped. Dropout is partially inspired by genetic reproduction, where half a parent’s genes are dropped and there is a small random mutation. This has the effect of training different networks at once (with a different number of nodes at each layer) and averaging their results, which typically produces more reliable results.

实现 dropout 的一种方法是为每一层引入一个超参数p,该参数指定该层中每个节点被丢弃的概率。回想一下每个节点发生的基本操作:线性组合上一层节点的输出,然后激活。使用 dropout,前一层节点的每个输出(从输入层开始)乘以一个随机数r,该数可以是 0 或 1,概率为p。因此,当节点的r值为 0 时,该节点实际上会从网络中删除,这会迫使其他保留的节点在一个梯度下降步骤中调整权重时弥补这一不足。我们将在“详细反向传播”中进一步解释这一点,此链接提供了实现 dropout 的分步路线。

One way to implement dropout is by introducing a hyperparameter p for each layer that specifies the probability at which each node in that layer will be dropped. Recall the basic operations that take place at each node: linearly combine the outputs of the nodes of the previous layer, then activate. With dropout, each output of the nodes of the previous layer (starting with the input layer), is multiplied by a random number r, which can be either 0 or 1 with probability p. Thus when a node’s r takes the value 0, that node is essentially dropped from the network, which now forces the other retained nodes to pick up the slack when adjusting the weights in one gradient descent step. We will explain this further in “Backpropagation in Detail”, and this link provides a step-by-step route to implementing dropout.

为了进行更深入的数学探索,2015 年的一篇论文将 dropout 与模型不确定性的贝叶斯近似联系起来。

For a deeper mathematical exploration, a 2015 paper connects dropout to Bayesian approximations of model uncertainty.

提前停止

Early Stopping

当我们更新时训练期间的权重,特别是梯度下降期间的权重,在每个时期之后,我们评估训练函数在数据验证子集的当前权重下所产生的误差。

As we update the weights during training, in particular during gradient descent, after each epoch, we evaluate the error made by the training function at the current weights on the validation subset of the data.

随着模型学习训练数据,这个误差应该会减少;然而,经过一定数量的 epoch 后,这个误差将开始增加,表明训练函数现在已经开始过度拟合训练数据,并且无法很好地推广到验证数据。一旦我们观察到模型对验证子集的预测有所增加,我们就停止训练并返回到误差最低的权重集,就在我们开始观察增加之前。

This error should be decreasing as the model learns the training data; however, after a certain number of epochs, this error will start increasing, indicating that the training function has now started overfitting the training data and is failing to generalize well to the validation data. Once we observe this increase in the model’s prediction over the validation subset, we stop training and go back to the set of weights where that error was lowest, right before we started observing the increase.

每层的批量归一化

Batch Normalization of Each Layer

主要思想这里的目的是标准化网络每一层的输入。这意味着每层的输入均值为 0,方差为 1。这通常是通过减去每个层输入的均值并除以方差来实现的。我们稍后将详细介绍这一点。在每个隐藏层执行此操作的原因与在原始输入层执行此操作的原因类似。

The main idea here is to normalize the inputs to each layer of the network. This means that the inputs to each layer will have mean 0 and variance 1. This is usually accomplished by subtracting the mean and dividing by the variance for each of the layer’s inputs. We will detail this in a moment. The reason this is good to do at each hidden layer is similar to why it is good at the original input layer.

应用批量归一化通常可以消除 dropout 的需要,并且允许我们对初始化不那么挑剔。它使训练更快、更安全,避免梯度消失和爆炸。它还具有正则化的额外优势。所有这些增益的成本并不太高,因为它通常只需要在每一层训练两个附加参数,一个用于缩放,一个用于移位。

Applying batch normalization often eliminates the need for dropout, and allows us to be less particular about initialization. It makes the training faster and safer from vanishing and exploding gradients. It also has the added advantage of regularization. The cost for all of these gains is not too high, as it usually involves training only two additional parameters, one for scaling and one for shifting, at each layer.

Ioffe 和 Szegedy 2015 年的论文介绍了该方法。他们论文的摘要描述了批量归一化过程及其解决的问题(括号是我自己的评论):

The 2015 paper by Ioffe and Szegedy introduced the method. The abstract of their paper describes the batch normalization process and the problems it addresses (the brackets are my own comments):

训练深度神经网络很复杂,因为在训练过程中,随着前一层参数的变化,每层输入的分布也会发生变化。由于需要较低的学习率和仔细的参数初始化,这会减慢训练速度,并且使得训练具有饱和非线性的模型变得非常困难[例如图 4-5 中的 sigmoid 类型激活函数,它几乎变得恒定,输出相同值当输入量很大时。这使得非线性在训练过程中变得无用,并且网络在后续层停止学习]。我们将这种现象(每层输入分布的变化)称为内部协变量偏移,并通过标准化层输入来解决该问题。我们的方法的优势在于将归一化作为模型架构的一部分,并对每个训练小批量执行归一化。批量归一化允许我们使用更高的学习率,并且在初始化时不那么小心,并且在某些情况下消除了 Dropout 的需要。批量归一化应用于最先进的图像分类模型,以减少 14 倍的训练步骤实现相同的精度,并大幅优于原始模型。使用批量归一化网络集合,我们改进了 ImageNet 分类的最佳已发表结果:达到 4.82% 的 top-5 测试误差,超过了人类评分者的准确性。

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities [such as the sigmoid type activation functions, in Figure 4-5, which become almost constant, outputting the same value when the input is large in magnitude. This renders the nonlinearity useless in the training process, and the network stops learning at subsequent layers]. We refer to this phenomenon [the change in the distribution of the inputs to each layer] as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

批量归一化通常在网络架构中在激活步骤之前或激活之后在其自己的层中实现。在训练期间,该过程通常遵循以下步骤:

Batch normalization is often implemented in the architecture of a network either in its own layer before the activation step, or after activation. The process, during training, usually follows these steps:

  1. 从大小为b的训练数据中选择一个批次。其中每个数据点都有特征向量 X ,所以整个批次都有特征向量 X 1 , X 2 , , X

  2. Choose a batch from the training data of size b. Each data point in this has feature vector x i , so the whole batch has feature vectors x 1 , x 2 , , x b .

  3. 计算向量,其条目是该特定批次中每个特征的平均值: μ = X 1 +X 2 ++X

  4. Calculate the vector whose entries are the means of each feature in this particular batch: μ = x 1 +x 2 ++x b b .

  5. 计算批次间的方差:减去 μ 从每个 X 1 , X 2 , , X ,计算结果 2 对b进行范数、加法和除法。

  6. Calculate the variance across the batch: subtract μ from each x 1 , x 2 , , x b , calculate the result’s l 2 norm, add, and divide by b.

  7. 标准化每一个 X 1 , X 2 , , X 减去平均值并除以方差的平方根:

  8. Normalize each of x 1 , x 2 , , x b by subtracting the mean and dividing by the square root of the variance:

  9. 通过可训练参数进行缩放和移动,这些参数可以通过梯度下降进行初始化和学习,与学习训练函数权重的方式相同。这成为第一个隐藏层的输入。

  10. Scale and shift by trainable parameters that can be initialized and learned by gradient descent, the same way the weights of the training function are learned. This becomes the input to the first hidden layer.

  11. 对后续各层的输入执行相同的操作。

  12. Do the same for the input of each of the subsequent layers.

  13. 对下一批重复此操作。

  14. Repeat for the next batch.

在测试和预测过程中,没有一批数据可供训练,并且每一层的参数都已经学习。然而,批量归一化步骤已经合并到训练函数的公式中。在训练期间,我们对每批训练数据进行更改。这反过来又稍微改变了每批次损失函数的公式。然而,归一化的目的部分是不要过多改变损失函数的公式,因为这反过来会改变其最小值的位置,这将导致我们永远追逐移动的目标。好吧,我们在训练期间通过批量归一化修复了这个问题,现在我们想要验证、测试和预测。对于我们正在测试/预测的特定数据点,我们使用哪个均值向量和方差?我们是否使用原始数据集特征的均值和方差?我们必须做出这样的决定。

During testing and prediction, there is no batch of data to train on, and the parameters at each layer are already learned. The batch normalization step, however, is already incorporated into the formula of the training function. During training, we were changing these per batch of training data. This in turn was changing the formula of the loss function slightly per batch. However, the point of normalization was partly not to change the formula of the loss function too much, because that in turn would change the locations of its minima, and that would cause us to forever chase a moving target. Alright, we fixed that with batch normalization during training, and now we want to validate, test, and predict. Which mean vector and variance do we use for a particular data point that we are testing/predicting at? Do we use the means and variances of the features of the original data set? We have to make such decisions.

通过惩罚权重范数来控制权重的大小

Control the Size of the Weights by Penalizing Their Norm

其他方式规范训练函数以避免过度拟合数据就是在最小化问题中引入竞争项。而不是求解权重集 ω 最小化损失函数:

Another way to regularize the training function to avoid overfitting the data is to introduce a competing term into the minimization problem. Instead of solving for the set of weights ω that minimizes only the loss function:

分钟 ω L ω

我们引入一个新术语 α ω 并求解权重集 ω 最大限度地减少:

we introduce a new term α ω and solve for the set of weights ω that minimizes:

分钟 ω L ω + α ω

例如,对于通常用于回归问题的均方误差损失函数,最小化问题如下所示:

For example, for the mean squared error loss function usually used for regression problems, the minimization problem looks like:

分钟 ω 1 Σ =1 | y predCt ω - y tre | 2 + α ω

回想一下,到目前为止我们已经建立了两种方法来解决最小化问题:

Recall that so far we have established two ways to solve that minimization problem:

最小值发生在导数(梯度)等于零的点
The minimum happens at points where the derivative (gradient) is equal to zero

所以最小化 ω 必须满足 L ω + α ω = 0 。然后我们求解这个方程 ω 如果我们有幸获得解决方案的封闭形式。在线性回归的情况下(我们可以将其视为一个极其简化的神经网络,只有一层和零非线性激活函数),我们确实有这种奢侈,对于这种正则化的情况,最小化公式 ω 是:

ω = X t X+α -1 X t y tre

其中X的列是用 1 向量增强的数据的特征列,B是单位矩阵(如果我们使用岭回归,稍后讨论)。带有正则化的极其简单的线性回归问题的封闭形式解决方案有助于我们理解权重衰减型正则化并看到它所发挥的重要作用。而不是反转矩阵 X t X 非正则化解决方案中,并担心其病态(例如,来自高度相关的输入特征)以及由此产生的不稳定性,我们反转 X t X + α 正则化解中。添加这个 α term 相当于在标量数的分母上添加一个小的正项,这有助于我们避免被零除。而不是使用 1 / X x存在为零的风险时,我们使用 1 / X + α 在哪里 α 是一个正常数。回想一下,矩阵求逆是标量除法的模拟。

So the minimizing ω must satisfy L ( ω ) + α ( ω ) = 0 . Then we solve this equation for ω if we have the luxury to get a closed form for the solution. In the case of linear regression (which we can think about as an extremely simplified neural network, with only one layer and zero nonlinear activation function), we do have this luxury, and for this regularized case, the formula for the minimizing ω is:

ω = (X t X+αB) -1 X t y true

where the columns of X are the feature columns of the data augmented with a vector of ones, and B is the identity matrix (if we use ridge regression, discussed later). The closed form solution for the extremely simple linear regression problem with regularization helps us appreciate weight decay type regularization and see the important role it plays. Instead of inverting the matrix ( X t X ) in the unregularized solution and worrying about its ill-conditioning (for example, from highly correlated input features) and the resulting instabilities, we invert ( X t X + α B ) in the regularized solution. Adding this α B term is equivalent to adding a small positive term to the denominator of a scalar number that helps us avoid division by zero. Instead of using 1 / x where x runs the risk of being zero, we use 1 / ( x + α ) where α is a positive constant. Recall that matrix inversion is the analog of scalar number division.

梯度下降
Gradient descent

当我们无法获得导数等于零方程的封闭形式解时,并且当我们的问题非常大时,计算二阶导数是非常昂贵。

We use gradient descent or any of its variations, such as stochastic gradient descent, when we do not have the luxury of obtaining closed form solutions for the derivative equals zero equation, and when our problem is very large, that computing second-order derivatives is extremely expensive.

常用的权重衰减正则化

Commonly used weight decay regularizations

控制权重大小的三种流行正则化是我们在本书中不断寻找的:

There are three popular regularizations that control the size of the weights that we are forever searching for in this book:

岭回归
Ridge regression

处罚 2 的范数 ω 。在这种情况下,我们添加术语 α Σ =1 n |ω | 2 到损失函数,然后我们最小化。

Penalize the l 2 norm of ω . In this case, we add the term α i=1 n |ω i | 2 to the loss function, then we minimize.

套索回归
Lasso regression

处罚 1 的规范 ω 的。在这种情况下,我们添加术语 α Σ =1 n | ω | 到损失函数,然后我们最小化。

Penalize the l 1 norm of the ω ’s. In this case, we add the term α i=1 n | ω i | to the loss function, then we minimize.

弹力网
Elastic net

这是岭回归和套索回归之间的中间立场情况。我们引入一个额外的超参数 γ ,它可以取 0 到 1 之间的任何值,并在损失函数中添加一项,该损失函数结合了岭回归和套索回归 γ γ α Σ =1 n | ω | 2 + 1 - γ α Σ =1 n | ω | 。什么时候 γ = 0 ,这变成了套索回归;当等于1时,为岭回归;当它介于零和一之间时,它是某种中间立场。

This is a middle-ground case between ridge and lasso regressions. We introduce one additional hyperparameter γ , which can take any value between zero and one, and add a term to the loss function that combines both ridge and lasso regressions through γ : γ α i=1 n | ω i | 2 + ( 1 - γ ) α i=1 n | ω i | . When γ = 0 , this becomes lasso regression; when it is equal to one, it is ridge regression; and when it is between zero and one, it is some sort of a middle ground.

我们什么时候使用普通线性回归、岭回归、套索或弹性网络?

When do we use plain linear regression, ridge, lasso, or elastic net?

如果您已经对构建机器学习模型的众多选择感到困惑和有点不知所措,请加入俱乐部,但不要感到沮丧。直到数学分析准确地告诉我们哪些选择比其他选择更好以及在什么情况下可用(或者通过数学计算和实验来吸引我们)之前,请像考虑家庭装修一样思考可用选择的巨大性:我们有从许多可用的材料、设计和架构中进行选择来生产最终产品。这是家居装修,而不是家居装饰,所以我们的决定是至关重要的,比单纯的家居装饰更重要。它们确实会影响我们最终产品的质量功能,但它们仍然是选择。别担心,给 AI 换肤的方法不止一种:

If you are already confused and slightly overwhelmed by the multitude of choices that are available for building machine learning models, join the club, but do not get frustrated. Until the mathematical analysis that tells us exactly which choices are better than others and under what circumstances becomes available (or catches us with mathematical computation and experimentation), think about the enormity of available choices the same way you think about a home renovation: we have to choose from many available materials, designs, and architectures to produce a final product. This is a home renovation, not a home decoration, so our decisions are fateful and more consequential than a mere home decoration. They do affect the quality and the function of our final product, but they are choices nevertheless. Rest easy, there is more than one way to skin AI:

  • 一些正则化总是好的。添加一个控制权重大小并与最小化损失函数竞争的项通常是好的。

  • Some regularization is always good. Adding a term that controls the sizes of the weights and competes with minimizing the loss function is good in general.

  • 岭回归通常是一个不错的选择,因为 2 范数是可微的。最小化这个比最小化更稳定 1 规范。

  • Ridge regression is usually a good choice because the l 2 norm is differentiable. Minimizing this is more stable than minimizing the l 1 norm.

  • 如果我们决定去 1 范数,即使它在 0 处不可微,我们仍然可以定义它在 0 处的次微分次梯度。例如,我们可以将其设置为零。注意 F X = | X | X 0 :当x > 0时导数为 1 ,当x < 0时导数为 -1 ;唯一的问题点是x=0

  • If we decide to go with the l 1 norm, even though it is not differentiable at 0, we can still define its subdifferential or subgradient at 0. For example, we can set that to be zero. Note that f ( x ) = | x | is differentiable when x 0 : it has derivative 1 when x > 0 and -1 when x < 0; the only problematic point is x=0.

  • 如果我们怀疑只有少数特征有用,那么最好使用套索或弹性网络作为数据预处理步骤来消除不太重要的特征。

  • If we suspect only a few features are useful, then it is good to use either lasso or elastic net as a data preprocessing step to kill off the less important features.

  • 弹性网络通常优于 lasso,因为当特征数量大于训练实例数量或多个特征强烈时,lasso 可能表现不佳相关的。

  • Elastic net is usually preferred over lasso because lasso might behave badly when the number of features is greater than the number of training instances or when several features are strongly correlated.

处罚 2 规范与惩罚 1 规范

Penalizing the l 2 Norm Versus Penalizing the l 1 Norm

我们的目标是找到 ω 解决了最小化问题:

Our goal is to find ω that solves the minimization:

分钟 ω L ω , ω 0 + α ω

第一项想要减少损失 L ω , ω 0 。另一项想要减少坐标值 ω 一直到零。我们选择的规范类型 ω 确定路径 ω 继续前往 0

The first term wants to decrease the loss L ( ω , ω 0 ) . The other term wants to decrease the values of the coordinates of ω all the way to zeros. The type of the norm that we choose for ω determines the path ω follows on its way to 0 .

如果我们使用 1 范数,坐标 ω 会减少;然而,他们中的许多人可能会过早死亡,先于其他人达到零。那就是 1 规范鼓励稀疏性:当权重消失时,它会消除相关特征对训练函数的贡献。

If we use the l 1 norm, the coordinates of ω will decrease; however, a lot of them might encounter premature death, hitting zero before others. That is, the l 1 norm encourages sparsity: when a weight dies, it kills the contribution of the associated feature to the training function.

图 4-13右图显示了菱形水平集 ω 1 = | ω 1 | + | ω 2 | 在二维中(如果我们只有两个特征),即 | ω 1 | + | ω 2 | = C 对于不同的c值。如果最小化算法遵循最速下降的路径,例如梯度下降,那么我们必须沿着垂直于水平集的方向行进,如图中箭头所示, ω 2 很快变为零,因为垂直于菱形水平集必然会撞击坐标轴之一,从而有效地消除相应的特征。 ω 1 然后沿水平轴行进至零。

The plot on the right in Figure 4-13 shows the diamond-shaped level sets of ω l 1 = | ω 1 | + | ω 2 | in two dimensions (if we only had two features), namely, | ω 1 | + | ω 2 | = c for various values of c. If a minimization algorithm follows the path of steepest descent, such as the gradient descent, then we must travel in the direction perpendicular to the level sets, and as the arrow shows in the plot, ω 2 becomes zero pretty fast since going perpendicular to the diamond-shaped level sets is bound to hit one of the coordinate axes, effectively killing the respective feature. ω 1 then travels to zero along the horizontal axis.

埃迈 0413
图 4-13。左图显示了圆形水平集 2 的范数 ω ,沿着梯度下降的方向趋向最小值 0 , 0 。右图显示了菱形水平集 1 的范数 ω ,沿着梯度下降的方向趋向最小值 0 , 0
  • 如果我们使用 2 正常情况下,重量尺寸变小并不一定会杀死它们。图 4-13左侧的图显示了圆形水平集 ω 2 = ω 1 2 + ω 2 2 在二维上,即 ω 1 2 + ω 2 2 = C 对于不同的c值。我们看到,沿着垂直于圆形水平面的路径设置到最小值 0 , 0 降低两者的值 ω 1 ω 2 它们中的任何一个都不会先于另一个变为零。

  • If we use the l 2 norm, the weight sizes get smaller without necessarily killing them. The plot on the left in Figure 4-13 shows the circular-shaped level sets of ω l 2 = ω 1 2 + ω 2 2 in two dimensions, namely, ω 1 2 + ω 2 2 = c for various values of c. We see that following the path perpendicular to the circular level sets toward the minimum at ( 0 , 0 ) decreases the values of both ω 1 and ω 2 without either of them becoming zero before the other.

选择哪种规范取决于我们的用例。请注意,在所有情况下,我们都不会对偏差权重进行正则化 ω 0 。这就是为什么在本节中我们将它们分别写在损失函数 L ω , ω 0

Which norm to choose depends on our use cases. Note that in all cases, we do not regularize the bias weights ω 0 . This is why in this section we wrote them separately in the loss function L ( ω , ω 0 ) .

解释正则化超参数的作用 α

Explaining the Role of the Regularization Hyperparameter α

权重衰减正则化的最小化问题如下所示:

The minimization problem with weight decay regularization looks like:

分钟 ω L ω + α ω

了解正则化超参数的作用 α ,我们观察到以下情况:

To understand the role of the regularization hyperparameter α , we observe the following:

  • 第一项之间存在竞争,其中损失函数 L ω 选择 ω 使训练函数适合训练数据,第二项只关心使 ω 值很小。这两个目标不一定是同步的。的价值观 ω 使第一项变小可能会使第二项变大,反之亦然。

  • There is a competition between the first term, where the loss function L ( ω ) chooses ω ’s that fit the training function to the training data, and the second term that just cares about making the ω values small. These two objectives are not necessarily in sync. The values of ω ’s that make the first term smaller might make the second term bigger and vice versa.

  • 如果 α 很大,那么最小化过程将通过使值进行补偿 ω 非常小,无论这些小值是否 ω 也会使第一项变小。所以我们增加的越多 α ,最小化第二项比第一项更重要,因此我们的最终模型可能最终无法完美地拟合数据(高偏差),但有时这是需要的(低方差),以便它可以很好地推广到看不见的数据。

  • If α is big, then the minimization process will compensate by making values of ω ’s very small, regardless of whether these small values of ω ’s will make the first term small as well. So the more we increase α , the more important minimizing the second term becomes than the first term, so our ultimate model might end up not fitting the data perfectly (high bias), but this is sometimes desired (low variance) so that it generalizes well to unseen data.

  • 另一方面,如果 α 很小(比如接近于零),那么我们可以选择更大的 ω 值,并且最小化第一项变得更加重要。在这里,最小化过程将导致 ω 使第一项满意的值,因此数据将很好地适合模型(低偏差),但方差可能很高。在这种情况下,我们的模型可以很好地处理所见数据(它旨在通过最小化来很好地拟合它) L ω ),但可能无法很好地推广到未见过的数据。

  • If, on the other hand, α is small (say, close to zero), then we can choose larger ω values, and minimizing the first term becomes more important. Here, the minimization process will result in ω values that make the first term happy, so the data will fit into the model nicely (low bias) but the variance might be high. In this case, our model would work well on seen data (it is designed to fit it nicely through minimizing L ( ω ) ), but might not generalize well to unseen data.

  • 作为 α 0 ,我们可以在数学上证明正则化问题的解收敛于不正则问题。

  • As α 0 , we can prove mathematically that the solution of the regularized problem converges to the solution of the unregularized problem.

机器学习中出现的超参数示例

Hyperparameter Examples That Appear in Machine Learning

我们有现在遇到很多进入机器学习模型的超参数。列出进入我们特定模型的那些及其值是一个很好的做法。让我们列出我们遇到过的那些,并回想一下,调整这些可以增强我们模型的性能。大多数时候,都有推荐值供我们使用。这些通常作为机器学习库和软件包中的默认值实现。然而,考虑到我们有可用的时间和资源,在建模过程的验证阶段尝试不同的值总是好的。超参数包括:

We have now encountered many hyperparameters that enter machine learning models. It is good practice to list the ones that enter our particular model along with their values. Let’s list the ones we have come across and recall that tuning these enhances the performance of our models. Most of the time, there are recommended values for us to use. These are usually implemented as default values in machine learning libraries and software packages. However, it is always good to experiment with different values during the validation stage of our modeling process, given that we have the available time and resources. The hyperparameters include:

  • 梯度下降的学习率。

  • The learning rate in gradient descent.

  • 权重衰减系数,例如出现在岭、套索和弹性网络正则化中的系数。

  • Weight decay coefficients, such as the ones that appear in ridge, lasso, and elastic net regularizations.

  • 我们停止训练之前的纪元数。

  • The number of epochs before we stop training.

  • 数据大小分为训练、验证和测试子集。

  • The sizes of data split into training, validation, and testing subsets.

  • 随机梯度下降及其变体过程中小批量的大小。

  • The sizes of mini-batches during stochastic gradient descent and its variants.

  • 动量法中的加速系数。

  • The acceleration coefficients in momentum methods.

  • 神经网络的架构:层数、每层神经元数量、每层发生的情况(批量归一化、激活函数类型)、正则化类型(dropout、ridge、lasso)、网络类型(前馈) ,密集,卷积,对抗,循环),损失函数的类型,ETC。

  • The architecture of a neural network: number of layers, number of neurons in each layer, what happens at each layer (batch normalization, type of activation function), type of regularization (dropout, ridge, lasso), type of network (feed forward, dense, convolutional, adversarial, recurrent), type of loss functions, etc.

链式法则和反向传播:计算 L ω

Chain Rule and Backpropagation: Calculating L ( ω i )

这是是时候动手计算一些重要的东西了:损失函数的梯度,即 L ω 。无论我们决定使用梯度下降、随机梯度下降、小批量梯度下降还是梯度下降的任何其他变体来找到最佳权重,都无法避免计算这个量。回想一下,损失函数的公式中包括神经网络的训练函数,而训练函数又由后续的线性组合和激活函数的组合组成。这意味着我们必须巧妙地使用链式法则。回到微积分,我们只对导数使用单变量链式法则,但现在我们必须以某种方式过渡到多个变量的链式法则:几个,有时是数十亿个。

It is time to get our hands dirty and compute something important: the gradient of the loss function, namely, L ( ω i ) . Whether we decide to find our optimal weights using gradient descent, stochastic gradient descent, mini-batch gradient descent, or any other variant of gradient descent, there is no escape from calculating this quantity. Recall that the loss function includes in its formula the neural network’s training function, which in turn is made up of subsequent linear combinations and compositions with activation functions. This means that we have to cleverly use the chain rule. Back in calculus, we only used the single variable chain rule for derivatives, but now we somehow have to transition to a chain rule of several variables: several, as in, sometimes billions.

神经网络的分层架构迫使我们停下来思考:我们到底要如何计算损失函数的这一导数?这里的主力是反向传播算法(也称为反向模式自动微分),它是一种功能强大的算法。

It is the layered architecture of a neural network that forces us to pause and think: how exactly are we going to compute this one derivative of the loss function? The workhorse here is the backpropagation algorithm (also called backward mode automatic differentiation), and it is a powerful one.

在编写公式之前,我们先总结一下训练神经网络时遵循的步骤:

Before writing formulas, let’s summarize the steps that we follow as we train a neural network:

  • 训练函数是 ω ,因此神经网络在某个数据点经过它之后的结果(与评估该数据点处的训练函数相同)为: t C e = F n C t n ω 。它由节点输出的线性组合组成,然后是具有激活函数的组合,并在网络的所有层上重复。输出层可能有也可能没有激活函数,并且可以有一个节点或多个节点,具体取决于网络的最终任务。

  • The training function is a function of ω , so the outcome of the neural network after a data point passes through it, which is the same as evaluating the training function at the data point, is: o u t c o m e = f u n c t i o n ( ω ) . This is made up of linear combinations of node outputs, followed by compositions with activation functions, repeated over all of the network’s layers. The output layer might or might not have an activation function and could have one node or multiple nodes, depending on the ultimate task of the network.

  • 损失函数可以衡量训练函数的结果与真实结果的偏差程度。

  • The loss function provides a measure of how badly the outcome of the training function diverged from what is true.

  • 我们用一组随机的权重初始化我们的学习函数 ω 0 ,根据前面几节中规定的首选初始化规则。然后我们计算由于使用这些特定权重值而造成的损失或误差。这是数据点通过网络的前向传递。

  • We initialize our learning function with a random set of weights ω 0 , according to preferred initialization rules prescribed in the previous sections. Then we compute the loss, or error, that we committed because of using these particular weight values. This is the forward pass of the data point through the net.

  • 我们想要转向下一组权重 ω 1 这给出了较低的错误。我们沿着与损失函数的梯度向量相反的方向移动。

  • We want to move to the next set of weights ω 1 that gives a lower error. We move in the direction opposite to the gradient vector of the loss function.

  • 但是:训练函数内置于损失函数中,并且考虑到该函数的分层结构(来自神经网络的架构)及其高维度,我们如何有效地执行多变量链式法则来查找梯度并在当前的权重集上对其进行评估?

  • But: the training function is built into the loss function, and given the layered structure of this function, which comes from the architecture of the neural network, along with its high dimensionality, how do we efficiently perform the multivariable chain rule to find the gradient and evaluate it at the current set of weights?

答案是,我们通过网络将数据点发回,计算从输出层一直到输入层的梯度,一路评估每个节点对误差的贡献。本质上,我们计算 L 节点功能 ,然后我们相应地调整权重,从 ω 0 ω 1 。当我们将更多数据点(通常是批量)传递到网络中时,该过程仍在继续。每当网络看到完整的训练集时,就会计算一个时期。

The answer is that we send the data point back through the network, computing the gradient backward from the output layer all the way to the input layer, evaluating along the way how each node contributed to the error. In essence, we compute L nodefunctions , then we tweak the weights accordingly, updating them from ω 0 to ω 1 . The process continues as we pass more data points into the network, usually in batches. One epoch is then counted each time the network has seen the full training set.

反向传播与我们大脑的学习方式并没有太大不同

Backpropagation Is Not Too Different from How Our Brain Learns

当我们遇到新的数学概念时,大脑中的神经元会建立某些联系。下次我们看到相同的概念时,相同的神经元连接得更好。我们的神经网络的类比是价值 ω 连接神经元的边缘增加。当我们一次又一次地看到相同的概念时,它就会成为我们大脑模型的一部分。这个模型不会改变,除非我们学到新的信息来撤销以前的信息。在这种情况下,神经元之间的联系就会减弱。对于我们的神经网络来说, ω 连接神经元的价值减少。调整 ω 通过最小化损失函数可以准确地实现这一点:在神经元之间建立正确的连接。

When we encounter a new math concept, the neurons in our brain make certain connections. The next time we see the same concept, the same neurons connect better. The analogy for our neural network is that the value ω of the edge connecting the neurons increases. When we see the same concept again and again, it becomes part of our brain’s model. This model will not change, unless we learn new information that undoes the previous information. In that case, the connection between the neurons weakens. For our neural network, the ω value connecting the neurons decreases. Tweaking the ω ’s via minimizing the loss function accomplishes exactly that: establishing the correct connections between the neurons.

神经科学家唐纳德·赫布他在 1949 年出版的《行为的组织:神经心理学理论》(释义)一书中提到:当一个生物神经元经常触发另一个神经元时,这两个神经元之间的联系就会变得更强。换句话说,一起放电的细胞连接在一起。

The neuroscientist Donald Hebb mentions in his 1949 book The Organization of Behavior: A Neuropsychological Theory (paraphrased): When a biological neuron triggers another neuron often, the connection between these two neurons grows stronger. In other words, cells that fire together, wire together.

同样,神经网络的计算模型会考虑网络在产生结果时所犯的错误。由于计算机只能理解数字, ω 如果节点有助于降低误差,则边的值会增加;如果节点有助于增加误差函数,则边的值会减小。因此,神经网络的学习规则通过增加相应的连接来加强连接,从而减少误差。 ω 的,并通过减少相应的连接来削弱增加误差的连接 ω 的。

Similarly, a neural network’s computational model takes into account the error made by the network when it produces an outcome. Since computers only understand numbers, the ω of an edge increases if the node contributes to lowering the error, and decreases if the node contributes to increasing the error function. So a neural network’s learning rule reinforces the connections that reduce the error by increasing the corresponding ω ’s, and weakens the connections that increase the error by decreasing the corresponding ω ’s.

为什么反向传播更好?

Why Is It Better to Backpropagate?

反向传播计算训练函数相对于每个节点的导数,通过网络向后移动。这衡量了每个节点对训练函数和损失函数的贡献 L ω

Backpropagation computes the derivative of the training function with respect to each node, moving backward through the network. This measures the contribution of each node to both the training function and the loss function L ( ω ) .

这里要回忆的最重要的公式是:微积分的链式法则。这计算链式函数(或函数组合)的导数。微积分链式法则主要处理仅依赖于一个变量的函数 ω ; 例如,对于三个链式函数,关于 ω 是:

The most important formula to recall here is: the chain rule from calculus. This calculates the derivatives of chained functions (or function compositions). The calculus chain rule mostly deals with functions depending only on one variable ω ; for example, for three chained functions, the derivative with respect to ω is:

d dω F 3 F 2 F 1 ω = { d dω F 1 ω } { d dF 1 F 2 F 1 ω } { d F 2 F 3 F 2 F 1 ω }

对于神经网络,我们必须将链式法则应用于损失函数,该函数取决于变量W和W 的矩阵和向量 ω 0 。所以我们必须将上述规则推广为多变量链式规则。最简单的方法是遵循网络结构向后计算导数,从结果层一直回到输入层。

For neural networks, we must apply the chain rule to the loss function that depends on the matrices and vectors of variables W and ω 0 . So we have to generalize the above rule to a many variables chain rule. The easiest way to do this is to follow the structure of the network computing the derivatives backward, from the outcome layer all the way back to the input layer.

相反,如果我们决定通过网络向前计算导数,我们将不知道这些关于每个变量的导数是否最终会对我们的最终结果做出贡献,因为我们不知道它们是否会通过网络图连接。即使图是完全连接的,较深层的权重也不存在于较早的层中,因此在较早的层中计算其导数是一种很大的浪费。

If instead we decide to compute the derivatives forward through the network, we would not know whether these derivatives with respect to each variable will ultimately contribute to our final outcome, because we do not know if they will connect through the graph of the network. Even when the graph is fully connected, the weights for deeper layers are not present in earlier layers, so it is a big waste to compute for their derivatives in the early layers.

当我们通过网络向后计算导数时,我们从输出开始,沿着网络图的边向后计算每个节点的导数。每个节点的贡献仅根据通向该节点的边和离开该节点的边来计算。这在计算上要便宜得多,因为现在我们确定这些节点如何以及何时对网络的结果做出贡献。

When we compute the derivatives backward through the network, we start with the output and follow the edges of the graph of the network back, computing the derivatives at each node. Each node’s contribution is calculated only from the edges leading to it and edges going out of it. This is computationally much cheaper because now we are sure of how and when these nodes contribute to the network’s outcome.

在线性代数中,计算矩阵与向量的乘法比计算两个矩阵的乘法要便宜得多。我们必须始终避免将两个矩阵相乘:计算 A 𝐯 比计算便宜 A 𝐯 ,尽管从理论上讲,这两者是完全相同的。对于大型矩阵和向量,这种简单的观察可以节省大量成本。

In linear algebra, it is much cheaper to compute the multiplication of a matrix with a vector than to compute the multiplication of two matrices together. We must always avoid multiplying two matrices with each other: computing A ( B 𝐯 ) is cheaper than computing ( A B ) 𝐯 , even though in theory, these two are exactly the same. Over large matrices and vectors, this simple observation provides enormous cost savings.

反向传播详细信息

Backpropagation in Detail

让我们暂停一下并感谢软件包的存在,这样我们就不必自己实现以下计算。我们也不要忘记感谢这些软件包的创建者。现在我们计算一下。

Let’s pause and be thankful that software packages exist so that we never have to implement the following computation ourselves. Let’s also not forget to be grateful to the creators of these software packages. Now we compute.

对于具有h 个隐藏层的神经网络,我们可以将损失函数写为训练函数的函数,而训练函数又是网络中出现的所有权重的函数:

For a neural network with h hidden layers, we can write the loss function as a function of the training function, which in turn is a function of all the weights that appear in the network:

L = L G 1 , ω 0 1 , 2 , ω 0 2 , , H , ω 0 H , H+1 , ω 0 H+1

我们将向后计算L的偏导数,从 L H+1 L ω 0 H+1 ,并努力回到 L 1 L ω 0 1 。它们的导数是针对相应矩阵或向量中的每个条目求的。

We will compute the partial derivatives of L backward, starting with L W h+1 and L ω 0 h+1 , and working our way back to L W 1 and L ω 0 1 . Their derivatives are taken with respect to each entry in the corresponding matrix or vector.

为简单起见,但不失一般性,假设该网络是预测单个数值的回归网络,因此训练函数g是标量(而不是向量)。还假设我们对整个网络中的每个神经元使用相同的激活函数f 。输出神经元没有激活,因为这是回归。

Suppose for simplicity, but without loss of generality, that the network is a regression network predicting a single numerical value, so that the training function g is scalar (not a vector). Suppose also that we use the same activation function f for each neuron throughout the network. The output neuron has no activation since this is a regression.

让我们求相对于指向输出层的权重的导数。损失函数为:

Let’s find the derivatives with respect to the weights pointing to the output layer. The loss function is:

L = L H+1 s H + ω 0 H+1

以便

so that

L ω 0 H+1 = 1 × L ' H+1 s H + ω 0 H+1

and

L H+1 = s H t L ' H+1 s H + ω 0 H+1

回想起那个 s H 是最后一层的输出,因此它取决于前面各层的所有权重,即 1 , ω 0 1 , 2 , ω 0 2 , , H , ω 0 H

Recall that s h is the output of the last layer, so it depends on all the weights of the previous layers, namely, ( W 1 , ω 0 1 , W 2 , ω 0 2 , , W h , ω 0 h ) .

为了计算指向最后一个隐藏层的权重的导数,我们在损失函数的公式中明确地显示了它们:

To compute derivatives with respect to the weights pointing to the last hidden layer, we show them explicitly in the formula of the loss function:

L = L H+1 F H s H-1 + ω 0 H + ω 0 H+1

以便

so that

L ω 0 H = 1 H+1 F ' H s H-1 + ω 0 H L ' H+1 F H s H-1 + ω 0 H + ω 0 H+1

and

L H = s H-1 H+1 F ' H s H-1 + ω 0 H L ' H+1 F H s H-1 + ω 0 H + ω 0 H+1

回想起那个 s H-1 是最后一个隐藏层之前隐藏层的输出,因此它取决于前面各层的所有权重,即 1 , ω 0 1 , 2 , ω 0 2 , , H-1 , ω 0 H-1

Recall that s h-1 is the output of the hidden layer before the last hidden layer, so it depends on all the weights of the previous layers, namely, ( W 1 , ω 0 1 , W 2 , ω 0 2 , , W h-1 , ω 0 h-1 ) .

我们系统地继续这个过程,直到达到输入层。

We continue the process systematically until we reach the input layer.

评估输入数据特征的重要性

Assessing the Significance of the Input Data Features

一球数据分析师的任务是评估输入变量(数据特征)相对于输出或目标变量的重要性。

One goal of data analysts is to assess the significance of the input variables (data features) with respect to the output or target variable.

这里要回答的主要问题是:如果我们调整某个输入变量的值,输出的相对变化是多少?

The main question to answer here is: if we tweak the value of a certain input variable, what is the relative change of the output?

例如,如果我们在给定的公交路线上再增加一辆公交车,这会影响公交车的整体乘客量吗?

For example, if we add one more bus on a given bus route, would that affect the overall bus ridership?

我们要问的数学问题是导数问题:求输出相对于所讨论的输入变量的偏导数。

The math question that we are asking is a derivative question: find the partial derivative of the output with respect to the input variable in question.

我们有大量关于模型线性时变量显着性的统计学文献(敏感性分析)。当模型是非线性的时,例如我们的神经网络模型,就没有那么多文献。我们不能基于非线性模型进行预测,然后采用为线性模型构建的变量显着性分析。许多使用内置软件包进行分析的数据分析师都陷入了这个陷阱。这是寻求深入理解我们业务决策所依据的模型假设的另一个原因。

We have plenty of literature in statistics on variable significance when the models are linear (sensitivity analysis). When the models are nonlinear, such as our neural network models, there isn’t as much literature. We cannot make our predictions based on nonlinear models, then employ variable significance analysis that is built for linear models. Many data analysts who use built-in software packages for their analysis fall into this trap. This is another reason to seek to understand deeply the assumptions of the models on which we base our business decisions.

总结与展望

Summary and Looking Ahead

本章代表着我们在AI领域正式过渡到深度学习时代。虽然第 3 章介绍了传统但仍然非常有用的机器学习模型,但第 4 章将神经网络添加到我们的机器学习模型库中。这两章都使用以下一般数学结构构建了模型:训练函数、损失函数和优化,其中每个模型都针对当前的特定任务和模型进行了定制。

This chapter represents our official transition to the deep learning era in the AI field. While Chapter 3 presented traditional yet still very useful machine learning models, Chapter 4 adds neural networks to our arsenal of machine learning models. Both chapters built the models with the general mathematical structure of: training function, loss function, and optimization, where each was tailored to the particular task and model at hand.

通过在多层神经网络的每个神经元上使用非线性激活函数,我们的训练函数能够获取数据中的复杂特征,而这些特征很难使用非线性函数的显式公式来描述。数学分析——特别是神经网络的万能逼近定理——支持了这种直觉,并提供了证明神经网络取得巨大成功的理论背景。然而,这些定理仍然缺乏为我们提供构建针对特定任务和数据集的特殊网络的能力,因此我们必须尝试各种架构、正则化和超参数,直到获得在以下方面表现良好的神经网络模型:新的和看不见的数据。

By employing nonlinear activation functions at each neuron of a neural network, over multiple layers, our training function is able to pick up on complex features in the data that are otherwise hard to describe using an explicit formula of a nonlinear function. Mathematical analysis—in particular, universal approximation theorems for neural networks—back up this intuition and provide a theoretical background that justifies the wild success of neural networks. These theorems, however, still lack the ability to provide us with a map to construct special networks tailored to specific tasks and data sets, so we must experiment with various architectures, regularizations, and hyperparameters until we obtain a neural network model that performs well on new and unseen data.

神经网络非常适合解决大数据集的大问题。此类大型问题的优化任务需要高效且计算成本低廉的方法,尽管该规模的所有计算都可以被认为是昂贵的。随机梯度下降是流行的优化方法,而反向传播算法是该方法的主力。更具体地说,反向传播算法计算当前权重选择的损失函数(或当我们添加权重衰减正则化时的目标函数)的梯度。了解目标函数的情况仍然是任何优化任务的核心,并且根据经验,凸问题比非凸问题更容易优化。神经网络模型中涉及的损失函数通常是非凸的。

Neural networks are well tailored to large problems with large data sets. The optimization task for such large problems requires efficient and computationally inexpensive methods, though all computations at that scale can be considered expensive. Stochastic gradient descent is the popular optimization method of choice, and the backpropagation algorithm is the workhorse of this method. More specifically, the backpropagation algorithm computes the gradient of the loss function (or the objective function when we add weight decay regularization) at the current weight choice. Understanding the landscape of the objective function remains central for any optimization task, and as a rule of thumb, convex problems are easier to optimize than nonconvex ones. Loss functions involved in neural network models are generally nonconvex.

第 4 章是本书最后一个基础(也是很长)的章节。我们终于可以在需要时讨论更专业的人工智能模型以及更深入的数学。接下来的章节是相互独立的,因此请按照与您直接应用领域最相关的顺序阅读它们。

Chapter 4 is the last foundational (and long) chapter in this book. We can finally discuss more specialized AI models, as well as deeper mathematics, when needed. The next chapters are independent from each other, so read them in the order that feels most relevant to your immediate application area.

最后,让我们总结一下本章中出现的数学知识,随着我们在该领域的进展,我们必须对这些数学知识进行更多的阐述:

Finally, let’s summarize the mathematics that appeared in this chapter, which we must elaborate more on as we progress in the field:

概率与测度
Probability and measure

这是证明万能逼近型定理所必需的,将在第 11 章中讨论。它还与辍学的不确定性分析有关。

This is needed to prove universal approximation-type theorems, and will be discussed in Chapter 11. It is also related to uncertainty analysis for dropout.

统计数据
Statistics

在神经网络每一层的批量归一化过程中输入标准化步骤,以及由此产生的相关分布的重塑。

Input standardizing steps during batch normalization at each layer of the neural network, and the resulting reshaping of the related distributions.

优化
Optimization

梯度下降、随机梯度下降、凸和非凸景观。

Gradient descent, stochastic gradient descent, convex, and nonconvex landscapes.

线性代数微积分
Calculus on linear algebra

反向传播算法:这是应用于变量矩阵函数的微积分链式法则。

Backpropagation algorithm: this is the chain rule from calculus applied on functions of matrices of variables.

第 5 章卷积神经网络和计算机视觉

Chapter 5. Convolutional Neural Networks and Computer Vision

他们。能。看。

H。

They. Can. See.

H.

卷积神经网络彻底改变了计算机视觉和自然语言处理领域。应用领域,无论与之相关的道德问题(例如监视、自动化武器等)如何,都是无限的:自动驾驶汽车、智能无人机、面部识别、语音识别、医学成像、生成音频、生成图像、机器人, ETC。

Convolutional neural networks have revolutionized the computer vision and the natural language processing fields. Application areas, irrespective of the ethical questions associated with them (such as surveillance, automated weapons, etc.), are limitless: self-driving cars, smart drones, facial recognition, speech recognition, medical imaging, generating audio, generating images, robotics, etc.

在本章中,我们从卷积互相关的简单定义和解释开始,并强调这两种略有不同的数学运算在机器学习术语中被混为一谈。我们犯了同样的罪,并将它们混为一谈,但有充分的理由。

In this chapter, we start with the simple definitions and interpretations of convolution and cross-correlation, and highlight the fact that these two slightly different mathematical operations are conflated in machine learning terminology. We perpetrate the same sin and conflate them as well, but with a good reason.

然后,我们将卷积运算应用于过滤网格状信号,它非常适合,例如时间序列数据(一维)、音频数据(一维)和图像(如果图像是灰度的,则为二维) ,如果是彩色图像则为三维,额外的维度对应于红色、绿色和蓝色通道)。当数据为一维时,我们使用一维卷积,当数据为二维时,我们使用二维卷积(为了简单和简洁,本章我们不会做三维卷积,对应三维彩色图像,称为张量)。换句话说,我们使网络适应数据的形状,这一过程对卷积神经网络的成功做出了巨大贡献。这与强制数据适应网络输入的形状形成鲜明对比,例如将二维图像展平为一个长向量,以使其适合仅采用一维数据作为输入的网络。在后面的章节中,我们看到这同样适用于图神经网络在图类型数据上的成功。

We then apply the convolution operation to filtering grid-like signals, which it is perfectly suited for, such as time series data (one-dimensional), audio data (one-dimensional), and images (two-dimensional if the images are grayscale, and three-dimensional if they are color images, with the extra dimension corresponding to the red, green, and blue channels). When data is one-dimensional, we use one-dimensional convolutions, and when it is two-dimensional, we use two-dimensional convolutions (for the sake of simplicity and conciseness, we will not do three-dimensional convolutions in this chapter, corresponding to three-dimensional color images, called tensors). In other words, we adapt our network to the shape of the data, a process that has wildly contributed to the success of convolutional neural networks. This is in contrast to forcing the data to adapt to the shape of the network’s input, such as flattening a two-dimensional image into one long vector to make it fit a network that only takes one-dimensional data as its input. In later chapters, we see that the same applies for the success of graph neural networks for graph-type data.

接下来,我们将卷积合并到前馈神经网络的架构中。卷积运算具有使网络局部连接而不是完全连接的效果。从数学上讲,包含每层权重的矩阵并不密集(因此大多数卷积层的权重为零)。此外,权重具有相似的值(权重共享),这与完全连接的神经网络的情况不同,在完全连接的神经网络中,为每个输入分配不同的权重。因此,包含卷积层权重的矩阵大部分为零,非零部分被局部化并共享相似的值。对于图像数据或音频数据来说,这很棒,因为大部分信息都包含在本地。此外,这大大减少了我们在优化步骤中需要存储和计算的权重数量,使卷积神经网络非常适合具有大量输入特征的数据(回想一下,对于图像,每个像素都是一个特征)。

Next, we incorporate convolution into the architecture of a feed forward neural network. The convolution operation has the effect of making the network locally connected as opposed to fully connected. Mathematically, the matrix containing the weights for each layer is not dense (so most of the weights for convolutional layers are zero). Moreover, the weights have similar values (weight sharing), unlike the case of fully connected neural networks, where a different weight is assigned to each input. Therefore, the matrix containing the weights of a convolutional layer is mostly zeros, and the nonzero parts are localized and share similar values. For image data or audio data, this is great, since most of the information is contained locally. Moreover, this dramatically reduces the number of weights that we need to store and compute during the optimization step, rendering convolutional neural networks ideal for data with a massive amount of input features (recall that for images, each pixel is a feature).

然后我们讨论池化,这是卷积神经网络架构中常见的另一层。与上一章一样,多层结构和每层的非线性都使我们能够从图像中提取日益复杂的特征,从而显着增强计算机视觉任务。

We then discuss pooling, which is another layer that is common to the architecture of convolutional neural networks. As in the previous chapter, both the multilayer structure and the nonlinearity at each layer enable us to extract increasingly complex features from images, significantly enhancing computer vision tasks.

一旦我们理解了卷积神经网络的基本结构,就可以直接将相同的数学应用于涉及自然语言处理的任务,例如情感分析、语音识别、音频生成等。事实上,相同的数学适用于计算机视觉和自然语言处理,这一事实类似于我们的大脑根据环境、经验和思想(大脑的虚拟模拟版本)做出物理变化的令人印象深刻的能力。即使大脑的某些部分受损,其他部分也可以接管并执行新的功能。例如,当大脑中负责视觉的部分受损时,它们可以开始执行听觉或记忆任务。在神经科学中,这称为神经可塑性。我们是距离完全了解大脑及其工作原理还很远,但对这种现象最简单的解释是大脑中的每个神经元都执行相同的基本功能,类似于神经网络中的神经元如何执行基本的数学计算(实际上二:线性组合,然后激活),多层上的各种神经连接产生了观察到的感知和行为的复杂性。

Once we comprehend the basic anatomy of a convolutional neural network, it is straightforward to apply the same mathematics to tasks involving natural language processing, such as sentiment analysis, speech recognition, audio generation, and others. The fact that the same mathematics works for both computer vision and natural language processing is akin to the impressive ability of our brain to physically change in response to circumstances, experiences, and thoughts (the brain’s version of virtual simulations). Even when some portions of the brain are damaged, other parts can take over and perform new functions. For example, the parts of the brain devoted to sight can start performing hearing or remembering tasks when those are impaired. In neuroscience, this is called neuroplasticity. We are very far from having a complete understanding of the brain and how it works, but the simplest explanation to this phenomenon is that each neuron in the brain performs the same basic function, similar to how neurons in a neural network perform one basic mathematical calculation (actually two: linearly combine, then activate), and the various neural connections over multiple layers produce the observed complexity in perception and behavior.

卷积神经网络实际上是受到大脑视觉新皮质的启发。2012 年图像分类 ( AlexNet2012 )的成功推动人工智能重新成为主流,激励了许多人,并将我们带到了这里。如果你碰巧有一些额外的时间,本章附带的一本不错的睡前读物将是关于大脑视觉新皮质的功能,以及它与专为视觉神经网络设计的卷积神经网络的类比。计算机视觉。

Convolutional neural networks are in fact inspired by the brain’s visual neocortex. The success in 2012 for image classification (AlexNet2012) propelled AI back into the mainstream, inspiring many and bringing us here. If you happen to have some extra time, a good bedtime read accompanying this chapter would be about the function of the brain’s visual neocortex, and its analogy to convolutional neural networks designed for computer vision.

卷积和互相关

Convolution and Cross-Correlation

卷积和互相关是略有不同的操作,测量信号中的不同内容,信号可以是数字图像、数字音频信号或其他信号。如果我们使用对称函数k(称为过滤器内核) ,它们是完全相同的。简而言之,卷积翻转滤波器,然后将其滑动穿过函数,而互相关将滤波器滑动穿过函数而不翻转它。当然,如果滤波器恰好是对称的,那么卷积和互相关是完全相同的。翻转核的优点是可以使卷积运算具有可交换性,从而有利于编写理论证明。也就是说,从神经网络的角度来看,交换性并不重要,原因有以下三个:

Convolution and cross-correlation are slightly different operations and measure different things in a signal, which can be a digital image, a digital audio signal, or others. They are exactly the same if we use a symmetric function k, called a filter or a kernel. Briefly speaking, convolution flips the filter, then slides it across the function, while cross-correlation slides the filter across the function without flipping it. Naturally, if the filter happens to be symmetric, then convolution and cross-correlation are exactly the same. Flipping the kernel has the advantage of making the convolution operation commutative, which in turn is good for writing theoretical proofs. That said, from a neural networks perspective, commutativity is not important, for three reasons:

  • 首先,卷积运算通常不会单独出现在神经网络中,而是与其他非线性函数组合在一起,因此无论是否翻转内核,我们都会失去交换性。

  • First, the convolution operation does not usually appear alone in a neural network, but is composed with other nonlinear functions, so we lose commutativity irrespective of whether we flip the kernel or not.

  • 其次,神经网络通常在训练过程中学习内核中条目的值;因此,它在正确的位置学习正确的值,并且翻转变得无关紧要。

  • Second, a neural network usually learns the values of the entries in the kernel during the training process; thus, it learns the correct values in the correct locations, and flipping becomes immaterial.

  • 第三,这对于实际实现很重要,卷积网络通常使用多通道卷积;例如,输入可以是具有红-绿-蓝通道的彩色图像,甚至是具有红-绿-蓝空间通道和一个时间通道的视频。此外,它们使用批处理模式卷积,这意味着它们批量获取输入向量、图像、视频或其他数据类型,并同时应用并行卷积运算。即使使用内核翻转,也不能保证这些操作是可交换的,除非每个操作具有与输入通道相同数量的输出通道。通常情况并非如此,因为多个通道的输出通常作为整体或部分求和在一起,从而产生与输入通道不同数量的输出通道。

  • Third, and this is important for practical implementation, convolutional networks usually use multichannel convolution; for example, the input can be a color image with red-green-blue channels, or even a video, with red-green-blue space channels and one time channel. Furthermore, they use batch mode convolution, meaning they take input vectors, images, videos, or other data types, in batches and apply parallel convolution operations simultaneously. Even with kernel flipping, these operations are not guaranteed to be commutative unless each operation has the same number of output channels as input channels. This is usually not the case, since the outputs of multiple channels are usually summed together as a whole or partially, producing a different number of output channels than input channels.

由于所有这些原因,许多机器学习库在实现卷积时不会翻转内核,本质上是实现互相关并称之为卷积。我们在这里也做同样的事情。

For all these reasons, many machine learning libraries do not flip the kernel when implementing convolution, in essence implementing cross-correlation and calling it convolution. We do the same here.

两个实值函数k(滤波器)和f之间的卷积运算定义为:

The convolution operation between two real valued functions k (the filter) and f is defined as:

k F t = -无穷大 无穷大 F s k - s + t d s = -无穷大 无穷大 F - s + t k s d s

离散函数的离散模拟是:

and the discrete analog for discrete functions is:

k F n = Σ s=-无穷大 无穷大 F s k - s + n = Σ s=-无穷大 无穷大 F - s + n k s

两个实值函数k(滤波器)和f之间的互相关运算定义为:

The cross-correlation operation between two real valued functions k (the filter) and f is defined as:

k * F t = -无穷大 无穷大 F s k s + t d s = -无穷大 无穷大 F s - t k s d s

离散函数的离散模拟是:

and the discrete analog for discrete functions is:

k * F n = Σ s=-无穷大 无穷大 F s k s + n = Σ s=-无穷大 无穷大 F s - n k s

请注意,定义卷积和互相关的公式看起来完全相同,除了对于卷积我们使用 - s而不是s。这对应于翻转所涉及的函数(在移位之前)。另请注意,卷积积分和总和中涉及的索引加起来为tn,这不是互相关的情况。这使得卷积可交换,从某种意义上说 F k n = k F n ,并且互相关不一定可交换。我们进一步评论交换性之后。

Note that the formulas defining convolution and cross-correlation look exactly the same, except that for convolution we use -s instead of s. This corresponds to flipping the involved function (before shifting). Note also that the indices involved in the convolution integral and sum add up to t or n, which is not the case for cross-correlation. This makes convolution commutative, in the sense that ( f k ) ( n ) = ( k f ) ( n ) , and cross-correlation not necessarily commutative. We comment further on commutativity later.

每次我们遇到对于一个新的数学对象,在深入研究之前最好先停下来问自己几个问题。这样我们就可以建立坚实的数学基础,同时设法避开技术和复杂数学的广阔而黑暗的海洋。

Every time we encounter a new mathematical object, it is good to pause and ask ourselves a few questions before diving in. This way we can build a solid mathematical foundation while managing to avoid the vast and dark ocean of technical and complicated mathematics.

我们手上有什么样的数学对象?
What kind of mathematical object do we have on our hands?

对于卷积和互相关,我们正在研究离散情况的无限和,以及连续情况的无限域上的积分。这引导我们的下一个问题。

For convolution and cross-correlation, we are looking at infinite sums for the discrete case, and integrals over infinite domains for the continuum case. This guides our next question.

由于我们是在无限域上求和,那么我们可以允许什么样的函数,而不会让我们的计算膨胀到无穷大?
Since we are summing over infinite domains, what kind of functions can we allow without our computations blowing up to infinity?

换句话说,对于什么样的函数,这些无限和和积分存在并且是明确定义的?在这里,从简单的答案开始,例如当fk被紧支持时(除了域的有限部分之外都是零),或者当fk衰减得足够快以允许无限和或积分收敛时。大多数时候,这些简单的情况对于我们的应用程序来说已经足够了,例如图像和音频数据,以及我们附带的过滤器。只有当我们发现我们的简单答案不适用于我们的特定用例时,我们才会寻求更通用的答案。这些以定理和证明的形式出现。不要首先寻求最普遍的答案,因为这些答案通常建立在一个庞大的数学理论之上,而这个数学理论需要几个世纪的时间、无数的问题和对答案的探索,才能形成并实现。首先走最普遍的道路是压倒性的、违反直觉的、反历史的,并且会消耗时间和资源。这不是数学和分析自然发展的方式。此外,如果我们碰巧遇到某人用最笼统和技术性的语言说话,却没有提供任何背景或动机来解释为什么需要这种程度的笼统性和复杂性,我们就会忽略他们,平静地继续我们的生活;否则它们可能会让我们陷入无法修复的困惑。

In other words, for what kind of functions do these infinite sums and integrals exist and are well-defined? Here, start with the easy answers, such as when f and k are compactly supported (are zeros except in a finite portion of the domain), or when f and k decay rapidly enough to allow the infinite sum or integral to converge. Most of the time, these simple cases are enough for our applications, such as image and audio data, and the filters that we accompany them with. Only when we find that our simple answer does not apply to our particular use case do we seek more general answers. These come in the form of theorems and proofs. Do not seek the most general answers first, as these are usually built on top of a large mathematical theory that took centuries, and countless questions and searches for answers, to take form and materialize. Taking the most general road first is overwhelming, counterintuitive, counter-historical, and time- and resource-draining. It is not how mathematics and analysis naturally evolve. Moreover, if we happen to encounter someone talking in the most general and technical language without providing any context or motivation for why this level of generality and complexity is needed, we just tune them out and move on peacefully with our lives; otherwise they might confuse us beyond repair.

这个数学对象有什么用?
What is this mathematical object used for?

卷积运算具有深远的应用,从纯理论数学到应用科学,再到工程和系统设计领域。在数学中,它出现在许多领域,如微分方程、测度论、概率、统计、分析和数值线性代数。在应用科学和工程中,它用于声学、光谱学、图像处理和计算机视觉,以及信号处理中有限脉冲响应滤波器的设计和实现。

The convolution operation has far-reaching applications that range from purely theoretical mathematics to applied sciences to the engineering and systems design fields. In mathematics, it appears in many fields, such as differential equations, measure theory, probability, statistics, analysis, and numerical linear algebra. In the applied sciences and engineering, it is used in acoustics, spectroscopy, image processing and computer vision, and in the design and implementation of finite impulse response filters in signal processing.

这个数学对象在我的特定研究领域有何用处?它如何应用于我的特定兴趣或用例?
How is this mathematical object useful in my particular field of study and how does it apply to my particular interest or use case?

出于人工智能的目的,我们将使用卷积运算为一维文本和音频数据以及二维图像数据构建卷积神经网络。同样的想法可以推广到任何类型的高维数据,其中大部分信息都包含在本地。我们在两种情况下使用卷积神经网络:理解图像、文本和音频数据;并生成图像、文本和音频数据。

此外,在数据和数据分布的背景下,我们使用以下结果,该结果与两个独立随机变量之和的概率分布有关。如果 μ ν 是拓扑群的概率测度 , + ,如果XY是两个独立的随机变量,其各自的分布为 μ ν ,然后卷积 μ ν 是随机变量X + Y之和的概率分布。我们将在第 11 章概率中详细阐述这一点。

For our AI purposes, we will use the convolution operation to construct convolutional neural networks for both one-dimensional text and audio data and two-dimensional image data. The same ideas generalize to any type of high-dimensional data where most of the information is contained locally. We use convolutional neural networks in two contexts: understanding image, text, and audio data; and generating image, text, and audio data.

Additionally, in the context of data and distributions of data, we use the following result that has to do with the probability distribution of the sum of two independent random variables. If μ and ν are probability measures on the topological group ( , + ) , and if X and Y are two independent random variables whose respective distributions are μ and ν , then the convolution μ ν is the probability distribution of the sum random variable X + Y. We elaborate on this in Chapter 11 on probability.

这个数学对象是如何产生的?
How did this mathematical object come to be?

值得投入一点额外的时间来了解我们关心的物体首次出现的时间、方式和原因的一些历史和时间顺序,以及与之相关的主要结果。换句话说,我们不是通过一系列枯燥的引理、命题和定理来研究我们的数学对象(这些对象通常没有任何背景),而是通过它自己的故事以及数学家在尝试发展数学过程中遇到的坎坷来学习。它。我们在这里学到的最有价值的见解之一是,数学随着我们寻求回答各种问题、建立联系以及更深入地理解我们需要使用的东西而有机地发展。现代数学分析是在试图回答与傅立叶正弦和余弦级数(将函数分解为其分量频率)相关的非常简单的问题时发展起来的,但事实证明,对于许多类型的函数来说,这些问题并不那么简单。例如:什么时候我们可以互换积分和无限和,什么是积分,什么是dx

让许多人,特别是那些对数学感到害怕或害怕的人感到惊讶的是,在寻求理解的过程中,一些数学界的大人物,包括某些领域的父亲和母亲,在数学的发展过程中犯了多个错误。并在后来纠正它们,或者被其他人纠正,直到理论最终形成。

最早的用途之一卷积积分出现于 1754 年达朗贝尔对泰勒定理的推导中。后来,在 1797 年至 1800 年间,它被西尔维斯特·弗朗索瓦·拉克鲁瓦 (Sylvestre François Lacroix) 使用在他的著作《差和级数论》中,该书是他的百科全书系列《微分和积分基本论》的一部分。此后不久,卷积运算出现在数学界非常著名的人物的著作中,例如拉普拉斯、傅立叶、泊松和沃尔泰拉。这里的共同点是所有这些研究都与积分、导数和函数级数有关。换句话说,微积分,以及将函数分解为其分量频率(傅立叶级数和变换)。

It is worth investing a tad bit of extra time to learn some of the history and the chronological order of when, how, and why the object that we care about first appeared, along with the main results associated with it. In other words, rather than studying our mathematical object through a series of dry lemmas, propositions, and theorems, which are usually deprived of all context, we learn through its own story, and through the ups and downs that mathematicians encountered while attempting to develop it. One of the most valuable insights we learn here is that mathematics develops organically with our quest to answer various questions, establish connections, and gain a deeper understanding of something that we need to use. Modern mathematical analysis developed while attempting to answer very simple questions related to Fourier sine and cosine series (decomposing a function into its component frequencies), which turned out to be not so simple for many types of functions. For example: when can we interchange the integral and the infinite sum, what is an integral, and what is dx anyway?

What comes as a surprise to many, especially to people who feel intimidated or scared of mathematics, is that during the quest to gain understanding, some of the biggest names in mathematics, including the fathers and mothers of certain fields, made multiple mistakes along the way and corrected them at later times, or were corrected by others until the theory finally took shape.

One of the earliest uses of the convolution integral appeared in 1754, in d’Alembert’s derivation of Taylor’s theorem. Later, between 1797 and 1800, it was used by Sylvestre François Lacroix in his book, Treatise on Differences and Series, which is part of his encyclopedic series, An Elementary Treatise on Differential Calculus and Integral Calculus. Soon thereafter, convolution operations appear in the works of very famous names in mathematics, such as Laplace, Fourier, Poisson, and Volterra. What is common here is that all of these studies have to do with integrals, derivatives, and series of functions. In other words, calculus, and again, decomposing functions into their component frequencies (Fourier series and transform).

在深入研究之前,我们必须了解与这个数学对象相关的最重要的运算、操作和/或定理是什么?
What are the most important operations, manipulations, and/or theorems related to this mathematical object that we must be aware of before diving in?

如果事物没有给很多人带来巨大的好处,就不会名声大噪或病毒式成功。卷积运算非常简单但又非常有用,并且可以巧妙地推广到更复杂的数学实体,例如度量和分布。它在加法和标量乘法上具有交换性、结合性、分配性;它的积分变成了一个乘积,它的导数变成了只求其中一个的微分组件功能。

Things do not spring into fame or viral success without being immensely beneficial to many, many people. The convolution operation is so simple yet so useful and generalizes neatly to more involved mathematical entities, such as measures and distributions. It is commutative, associative, distributive over addition and scalar multiplication; its integral becomes a product, and its derivative transforms to differentiating only one of its component functions.

平移不变性和平移等变性

Translation Invariance and Translation Equivariance

这些卷积神经网络的特性使我们能够检测图像不同部分的相似特征。即,在图像的一个位置中出现的图案在图像的其他位置中很容易被识别。主要原因是,在神经网络的卷积层,我们在整个图像中使用相同的滤波器(也称为内核或模板)进行卷积,拾取相同的模式(例如边缘,或水平、垂直和对角线方向)。该过滤器具有一组固定的权重。回想一下,对于完全连接的神经网络来说情况并非如此,我们必须对图像的不同像素使用不同的权重。由于我们使用卷积而不是矩阵乘法来执行图像过滤,因此我们具有平移不变性的好处,因为通常我们关心的是模式的存在,而不管其位置如何。从数学上来说,平移不变性看起来像:

These properties of convolutional neural networks enable us to detect similar features in different parts of an image. That is, a pattern that occurs in one location of an image is easily recognized in other locations of the image. The main reason is that at a convolutional layer of a neural network, we convolute with the same filter (also called a kernel, or template) throughout the image, picking up on the same patterns (such as edges, or horizontal, vertical, and diagonal orientations). This filter has a fixed set of weights. Recall that this is not the case for fully connected neural networks, where we would have to use different weights for different pixels of an image. Since we use convolution instead of matrix multiplication to perform image filtering, we have the benefit of translation invariance, since usually all we care for is the presence of a pattern, irrespective of its location. Mathematically, translation invariance looks like:

t r A n s A k F = k t r A n s A F = t r A n s A k F t

在哪里 t r A n s A 是a对函数的翻译。就我们的人工智能目的而言,这意味着给定一个旨在拾取图像中特定特征的滤波器,将其与平移图像(在水平或垂直方向)进行卷积,与对图像进行过滤然后平移它相同。这个属性有时称为平移等方差,而平移不变性则归因于通常内置于卷积神经网络架构中的池化层。我们将在本章稍后讨论这个问题。无论如何,我们在整个图像的每一层使用一个滤波器(一组权重)这一事实意味着我们在图像的不同位置(如果存在)检测到一种模式。这同样适用于音频数据或任何其他类型的网格状数据。

where t r a n s a is the translation of a function by a. For our AI purposes, this implies that given a filter designed to pick up on a certain feature in an image, convolving it with a translated image (in the horizontal or vertical directions), is the same as filtering the image then translating it. This property is sometimes called translation equivariance, and translation invariance is instead attributed to the pooling layer that is often built into the architecture of convolutional neural networks. We discuss this later in this chapter. In any case, the fact that we use one filter (one set of weights) throughout the whole image at each layer means that we detect one pattern at various locations of the image, when present. The same applies to audio data or any other type of grid-like data.

普通空间中的卷积是频率空间中的乘积

Convolution in Usual Space Is a Product in Frequency Space

两个函数的卷积的傅里叶变换是每个函数的傅里叶变换的乘积,直至缩放。简而言之,傅里叶变换将函数分解为其频率分量(我们将在第 13 章中详细介绍傅里叶变换)。因此,卷积运算不会创建新的频率,并且卷积函数中存在的频率只是分量函数的频率的乘积。

The Fourier transform of the convolution of two functions is the product of the Fourier transform of each function, up to a scaling. In short, the Fourier transform resolves a function into its frequency components (we elaborate on the Fourier transform in Chapter 13). Therefore, the convolution operation does not create new frequencies, and the frequencies present in the convolution function are simply the product of the frequencies of the component functions.

在数学中,我们不断更新解决不同问题的有用工具库,因此根据我们的专业领域,此时我们会分支并研究相关但更复杂的结果,例如周期函数的循环卷积,计算卷积等的首选算法。我们以对人工智能目的有用的方式深入研究卷积的最佳途径是通过信号和系统设计,这是下一节的主题。

In mathematics, we are forever updating our arsenal of useful tools for tackling different problems, so depending on our domain of expertise, it is at this point that we branch out and study related but more involved results, such as circular convolutions for periodic functions, preferred algorithms for computing convolutions, and others. The best route for us to dive into convolution in a way that is useful for our AI purposes is through signal and system design, which is the topic of the next section.

从系统设计角度看卷积

Convolution from a Systems Design Perspective

我们是被系统包围,每个系统都与其环境交互并被设计来完成特定的任务。例如,建筑物中的 HVAC 系统、汽车中的自适应巡航控制系统、城市交通系统、灌溉系统、各种通信系统、安全系统、导航系统、数据中心等。有些系统相互交互,有些则不交互。有些非常大,有些则小而简单,就像一个单一设备,例如恒温器,通过传感器接收来自环境的信号,处理它们,然后输出其他信号,例如输出到执行器。

We are surrounded by systems, each interacting with its environment and designed to accomplish a certain task. Examples include HVAC systems in our buildings, adaptive cruise control systems in our cars, city transportation systems, irrigation systems, various communication systems, security systems, navigation systems, data centers, etc. Some systems interact with each other, others do not. Some are very large, others are as small and simple as a single device, such as a thermostat, receiving signals from its environment through sensors, processing them, and outputting other signals, for example, to actuators.

如果我们施加两个特殊约束:线性时不变性,那么在设计和分析此类简单系统(处理输入信号并产生输出信号)时就会出现卷积运算。如果我们处理与空间相关的信号(例如图像)而不是与时间相关的信号(例如电信号或音频信号),则线性和时间不变性将变为线性和平移或平移不变性。请注意,视频既依赖于空间(两个或三个空间维度)又依赖于时间。在这种情况下,线性度与缩放信号(放大或缩小)的输出以及两个叠加信号的输出有关。时间和平移不变性与延迟(时间相关)信号的输出或平移或移位(空间相关)信号的输出有关。我们将在下一节中详细介绍这些内容。

The convolution operation appears when designing and analyzing such simple systems, which process an input signal and produce an output signal, if we impose two special constraints: linearity and time invariance. Linearity and time invariance become linearity and translation or shift invariance if we are dealing with space-dependent signals, such as images, instead of time-dependent signals, such as electrical signals or audio signals. Note that a video is both space- (two or three spatial dimensions) and time-dependent. Linearity in this context has to do with the output of a scaled signal (amplified or reduced), and the output of two superimposed signals. Time and translation invariance have to do with the output of a delayed (time-dependent) signal, or the output of a translated or shifted (space-dependent) signal. We will detail these in the next section.

线性和时间或平移不变性一起发挥着非常强大的作用。仅当我们知道简单脉冲信号的输出(称为系统的脉冲响应)时,它们才允许我们找到系统的任何信号的输出。系统对任何输入信号的输出只需将信号与系统的脉冲响应进行卷积即可获得。因此,施加线性和时间或平移不变性的条件极大地简化了信号处理系统的分析。

Together, linearity and time or translation invariance are very powerful. They allow us to find the system’s output of any signal only if we know the output of a simple impulse signal, called the system’s impulse response. The system’s output to any input signal is obtained by merely convolving the signal with the system’s impulse response. Therefore, imposing the conditions of linearity and time or translation invariance dramatically simplifies the analysis of signal processing systems.

那么,令人烦恼的问题是:这些创造奇迹的条件有多现实?换句话说,线性和时间/平移不变系统,或者甚至只是近似线性和近似时间/平移不变系统有多普遍?大多数现实系统不是非线性且复杂的吗?值得庆幸的是,我们可以控制系统设计,因此我们可以决定设计具有这些属性的系统。一个例子是由电容器、电阻器、电感器和线性放大器组成的任何电路。这实际上在数学上相当于理想的机械弹簧、质量和阻尼器系统。与我们相关的其他示例包括处理和过滤各种类型的信号和图像。我们将在接下来的几节中讨论这些内容。

The nagging question is then: how realistic are these wonder-making conditions? In other words, how prevalent are linear and time/translation invariant systems, or even only approximately linear and approximately time/translation invariant systems? Aren’t most realistic systems nonlinear and complex? Thankfully, we have control over system designs, so we can just decide to design systems with these properties. One example is any electrical circuit consisting of capacitors, resistors, inductors, and linear amplifiers. This is in fact mathematically equivalent to an ideal mechanical spring, mass, and damper system. Other examples that are relevant for us include processing and filtering various types of signals and images. We discuss these in the next few sections.

线性和平移不变系统的卷积和脉冲响应

Convolution and Impulse Response for Linear and Translation Invariant Systems

让我们形式化线性系统和时间/平移不变系统的概念,然后了解当尝试量化具有这些属性的系统对任何信号的响应时,卷积运算是如何自然产生的。从数学的角度来看,系统是一个函数 H 需要输入信号 X 并产生输出信号 y 。信号 X y 可以取决于时间、空间(单维或多维)或两者。如果我们对这样的函数强制线性,那么我们就声明了两件事:

Let’s formalize the concepts of a linear system and a time/translation invariant system, then understand how the convolution operation naturally arises when attempting to quantify the response of a system possessing these properties to any signal. From a math perspective, a system is a function H that takes an input signal x and produces an output signal y . The signals x and y can depend on time, space (single or multiple dimensions), or both. If we enforce linearity on such a function, then we are claiming two things:

  1. 缩放输入信号的输出只不过是原始输出的缩放: H A X = A H X = A y

  2. Output of a scaled input signal is nothing but a scaling of the original output: H ( a x ) = a H ( x ) = a y .

  3. 两个叠加信号的输出只不过是两个原始输出的叠加: H X 1 + X 2 = H X 1 + H X 2 = y 1 + y 2

  4. Output of two superimposed signals is nothing but the superposition of the two original outputs: H ( x 1 + x 2 ) = H ( x 1 ) + H ( x 2 ) = y 1 + y 2 .

如果我们强制执行时间/平移不变性,那么我们就声称延迟/平移/移位信号的输出只不过是延迟/平移/移位的原始输出: H X t - t 0 = y t - t 0

If we enforce time/translation invariance, then we are claiming that the output of a delayed/translated/shifted signal is nothing but a delayed/translated/shifted original output: H ( x ( t - t 0 ) ) = y ( t - t 0 ) .

当我们考虑任何任意信号(无论是离散信号还是连续信号)时,我们可以利用这些条件,将其视为一堆不同幅度的脉冲信号的叠加。这样,如果我们能够测量系统对单个脉冲信号的输出,称为系统的脉冲响应,那么这足以测量系统对任何其他信号的响应。这成为非常丰富的理论的基础。我们在本章中只介绍一个离散的案例,因为我们在人工智能中关心的信号(例如,自然语言处理、人机交互和计算机视觉),无论是一维音频信号,还是二维或三维信号图像,是离散的。连续情况是类似的,只是我们考虑无穷小步骤而不是离散步骤,我们使用积分而不是和,并且我们强制连续性条件(或者我们需要使所涉及的积分明确定义的任何条件)。实际上,连续情况带来了一点额外的复杂性:必须以数学上合理的方式正确定义脉冲因为它不是通常意义上的函数。值得庆幸的是,有多种数学方法可以对其进行明确定义,例如分布理论,或将其定义为作用于通常函数的运算符,或作为度量并使用勒贝格积分的帮助。然而,我一直在避免测量、勒贝格积分和任何深入的数学理论,直到我们真正需要它们及其附加功能,所以现在,离散情况就足够了。

We can leverage these conditions when we think of any arbitrary signal, whether discrete or continuous, in terms of a superposition of a bunch of impulse signals of various amplitudes. This way, if we are able to measure the system’s output to a single impulse signal, called the system’s impulse response, then that is enough information to measure the system’s response to any other signal. This becomes the basis of a very rich theory. We only walk through a discrete case in this chapter, since the signals we care for in AI (for example, for natural language processing, human machine interaction and computer vision), whether one-dimensional audio signals, or two- or three-dimensional images, are discrete. The continuous case is analogous except that we consider infinitesimal steps instead of discrete steps, we use integrals instead of sums, and we enforce continuity conditions (or whatever conditions we need to make the involved integrals well-defined). Actually, the continuous case brings a little extra complication: having to properly define an impulse in a mathematically sound way, for it is not a function in the usual sense. Thankfully, there are multiple mathematical ways to make it well-defined, such as the theory of distributions, or defining it as an operator acting on usual functions, or as a measure and using the help of Lebesgue integrals. However, I have been avoiding measures, Lebesgue integrals, and any deep mathematical theory until we truly need them with their added functionality, so for now, the discrete case is very sufficient.

我们定义单位脉冲 δ k 对于每个非零k为零,当k = 0 时为 1,并将其响应定义为 H δ k = H k 。然后 δ n - k 对于每个k为零,对于k = n为 1 。这表示位于k = n处的单位脉冲。所以, X k δ n - k 是位于k = n处的振幅为x ( k )的脉冲。现在我们可以将输入信号x ( n ) 写为:

We define a unit impulse δ ( k ) to be zero for each nonzero k, and one when k = 0, and define its response as H ( δ ( k ) ) = h ( k ) . Then δ ( n - k ) would be zero for each k, and 1 for k = n. This represents a unit impulse located at k = n. Therefore, x ( k ) δ ( n - k ) is an impulse of amplitude x(k) located at k = n. Now we can write the input signal x(n) as:

X n = Σ k=-无穷大 无穷大 X k δ n - k

上面的总和可能看起来像是一种复杂的信号写入方式,确实如此,但它非常有用,因为它表示任何离散信号都可以表示为单位脉冲的无限总和,这些单位脉冲在正确的位置正确缩放。现在,使用H 的线性和平移不变性假设,可以直接看出系统对信号x ( n ) 的响应为:

The above sum might seem like such a convoluted way to write a signal, and it is, but it is very helpful, since it says that any discrete signal can be expressed as an infinite sum of unit impulses that are scaled correctly at the right locations. Now, using the linearity and translation invariance assumptions on H, it is straightforward to see that the system’s response to the signal x(n) is:

H X n = H Σ k=-无穷大 无穷大 X k δ n - k = Σ k=-无穷大 无穷大 X k H δ n - k = Σ k=-无穷大 无穷大 X k H n - k = X H n = y n

因此,线性平移不变系统完全由其脉冲响应h ( n ) 来描述。但是还有另一种方式来看待这个问题,它独立于线性和平移不变系统,并且对于我们接下来几节的目的非常有用。该声明:

Therefore, a linear and translation invariant system is completely described by its impulse response h(n). But there is another way to look at this, which is independent of linear and translation invariant systems, and which is very useful for our purposes in the next few sections. The statement:

y n = X H n

表示信号x ( n )与滤波器h ( n ) 进行卷积后可以转换为信号y ( n ) 。因此,仔细设计滤波器h ( n )可以产生具有某些所需特性的y ( n ),或者可以从信号x ( n )中提取某些特征,例如所有边缘。此外,使用不同的滤波器h ( n ) 从同一信号x ( n ) 中提取不同的特征。我们将在接下来的几节中详细阐述这些概念;同时,请记住以下几点:当信号或图像等信息流经卷积神经网络时,每个卷积层都会提取(映射)不同的特征。

says that the signal x(n) can be transformed into the signal y(n) after we convolve it with the filter h(n). Hence, designing the filter h(n) carefully can produce a y(n) with some desired characteristics, or can extract certain features from the signal x(n), for example, all the edges. Moreover, using different filters h(n) extracts different features from the same signal x(n). We elaborate more on these concepts in the next few sections; meanwhile, keep the following in mind: as information, such as signals or images, flows through a convolutional neural network, different features get extracted (mapped) at each convolutional layer.

在离开线性和时间/平移不变系统之前,我们必须提到此类系统对正弦输入有非常简单的响应。如果系统的输入是具有给定频率的正弦波,则输出也是具有相同频率的正弦波,但可能具有不同的幅度和相位。此外,了解系统的脉冲响应使我们能够计算它的频率响应,这是系统在所有频率下对正弦波的响应,反之亦然。也就是说,确定系统的脉冲响应使我们能够计算其频率响应,而确定频率响应使我们能够计算其脉冲响应,从而完全确定系统对任意信号的响应。从理论和应用的角度来看,这种联系都非常有用,并且与傅里叶变换和信号的频域表示密切相关。简而言之,线性和平移不变系统的频率响应只是其脉冲响应的傅立叶变换。我们在这里不讨论计算细节,因为这些概念对于本书的其余部分来说并不是必不可少的;然而,重要的是要意识到这些联系,并了解来自看似不同领域的事物如何聚集在一起并与它们相关联。彼此。

Before leaving linear and time/translation invariant systems, we must mention that such systems have a very simple response for sinusoidal inputs. If the input to the system is a sine wave with a given frequency, then the output is also a sine wave with the same frequency, but possibly with a different amplitude and phase. Moreover, knowing the impulse response of the system allows us to compute its frequency response, which is the system’s response for sine waves at all frequencies, and vice versa. That is, determining the system’s impulse response allows us to compute its frequency response, and determining the frequency response allows us to compute its impulse response, which in turn completely determines the system’s response to any arbitrary signal. This connection is very useful from both theoretical and applications perspectives, and is closely connected to Fourier transforms and frequency domain representations of signals. In short, the frequency response of a linear and translation invariant system is simply the Fourier transform of its impulse response. We do not go over the computational details here, since these concepts are not essential for the rest of this book; nevertheless, it is important to be aware of these connections and understand how things from seemingly different fields come together and relate to each other.

卷积和一维离散信号

Convolution and One-Dimensional Discrete Signals

我们来剖析一下卷积运算并了解它如何通过滑动滤波器(内核)从输入信号创建新信号。我们不会在这里翻转内核,因为我们已经确定从神经网络的角度来看翻转内核是无关紧要的。我们从一维离散信号x ( n ) 开始,然后将其与滤波器(内核)k ( n ) 进行卷积,并产生新信号z ( n )。我们展示如何通过沿着x ( n ) 滑动k ( n )来一次获得z ( n ) 一个条目。为了简单起见,让 X n = X 0 , X 1 , X 2 , X 3 , X 4 k n = k 0 , k 1 , k 2 。在此示例中,输入信号x ( n ) 仅具有五个条目,而内核具有三个条目。在实践中,例如在信号处理、图像过滤或人工智能神经网络中,输入信号x ( n ) 比滤波器k ( n ) 大几个数量级。我们很快就会看到这一点,但这里的例子仅供说明之用。回想一下离散互相关的公式,它是不翻转内核的卷积:

Let’s dissect the convolution operation and understand how it creates a new signal from an input signal by sliding a filter (kernel) against it. We will not flip the kernel here, as we established the fact that flipping the kernel is irrelevant from a neural network perspective. We start with a one-dimensional discrete signal x(n), then we convolve it with a filter (kernel) k(n), and produce a new signal z(n). We show how we obtain z(n) one entry at a time by sliding k(n) along x(n). For simplicity, let x ( n ) = ( x 0 , x 1 , x 2 , x 3 , x 4 ) and k ( n ) = ( k 0 , k 1 , k 2 ) . In this example the input signal x(n) only has five entries, and the kernel has three entries. In practice, such as in signal processing, image filtering, or AI’s neural networks, the input signal x(n) is orders of magnitude larger than the filter k(n). We will see this very soon, but the example here is only for illustration. Recall the formula for discrete cross-correlation, which is the convolution without flipping the kernel:

k * X n = Σ s=-无穷大 无穷大 X s k s + n = + X - 1 k - 1 + n + X 0 k n + X 1 k 1 + n + X 2 k 2 + n +

由于x ( n ) 和k ( n ) 都没有无限多个条目,并且总和是无限的,因此只要未定义索引,我们就假装条目为零。卷积产生的新的滤波信号将仅具有索引为:-4、-3、-2、-1、0、1、2 的重要条目。让我们写下每个条目:

Since neither x(n) nor k(n) has infinitely many entries, and the sum is infinite, we pretend that the entries are zero whenever the indices are not defined. The new filtered signal resulting from the convolution will only have nontrivial entries with indices: -4, -3, -2, -1, 0, 1, 2. Let’s write each of these entries:

k * X - 4 = X 4 k 0 k * X - 3 = X 3 k 0 + X 4 k 1 k * X - 2 = X 2 k 0 + X 3 k 1 + X 4 k 2 k * X - 1 = X 1 k 0 + X 2 k 1 + X 3 k 2 k * X 0 = X 0 k 0 + X 1 k 1 + X 2 k 2 k * X 1 = X 0 k 1 + X 1 k 2 k * X 2 = X 0 k 2

通过心理图片更容易理解该操作:修复信号 X n = X 0 , X 1 , X 2 , X 3 , X 4 并滑动过滤器 k n = k 0 , k 1 , k 2 从右到左反对它。这也可以使用线性代数表示法进行简明概括,其中我们将输入向量x ( n ) 乘以一种特殊的矩阵,称为托普利茨矩阵,其中包含滤波器权重。我们将在本章后面详细阐述这一点。

That operation is easier to understand with a mental picture: fix the signal x ( n ) = ( x 0 , x 1 , x 2 , x 3 , x 4 ) and slide the filter k ( n ) = ( k 0 , k 1 , k 2 ) against it from right to left. This can also be summarized concisely using linear algebra notation, where we multiply the input vector x(n) by a special kind of matrix, called the Toeplitz matrix, containing the filter weights. We will elaborate on this later in this chapter.

当我们在信号上滑动滤波器时,输出信号将在x ( n ) 和k ( n ) 匹配的索引处达到峰值,因此我们可以以拾取x ( n ) 中某些模式的方式设计滤波器。这样,卷积(未翻转)提供了信号和内核之间相似性的度量。

When we slide a filter across the signal, the output signal will peak at the indices where x(n) and k(n) match, so we can design the filter in a way that picks up on certain patterns in x(n). This way, convolution (unflipped) provides a measure of similarity between the signal and the kernel.

我们还可以以不同的方式查看互相关或未翻转卷积:输出信号的每个条目都是输入信号条目的加权平均值。这样,我们既强调了这种变换的线性性,又强调了我们可以通过强调某些特征而不是其他特征的方式来选择内核中的权重这一事实。我们通过接下来讨论的图像处理可以更清楚地看到这一点,但为此我们必须编写二维卷积(未翻转)的公式。在线性代数表示法中,离散二维卷积相当于二维信号乘以另一种特殊的矩阵,称为双块循环矩阵,也出现在本章后面。请记住,与矩阵相乘是线性的转型。

We can also view cross-correlation, or unflipped convolution, in a different way: each entry of the output signal is a weighted average of the entries of the input signal. This way, we emphasize both the linearity of this transformation and the fact that we can choose the weights in the kernel in way that emphasizes certain features over others. We see this more clearly with image processing, discussed next, but for that we have to write a formula for convolution (unflipped) in two dimensions. In linear algebra notation, discrete two-dimensional convolution amounts to multiplication of the two-dimensional signal by another special kind of matrix, called the doubly block circulant matrix, also appearing later in this chapter. Keep in mind that multiplication by a matrix is a linear transformation.

卷积和二维离散信号

Convolution and Two-Dimensional Discrete Signals

二维卷积(未翻转)运算如下所示:

The convolution (unflipped) operation in two dimensions looks like:

k * X , n = Σ q=-无穷大 无穷大 Σ s=-无穷大 无穷大 X + q , n + s k q , s

例如,以下之间的卷积(未翻转)的 (2,1) 条目 4 × 4 矩阵A 3 × 3 内核K

For example, the (2,1) entry of the convolution (unflipped) between the following 4 × 4 matrix A and the 3 × 3 kernel K:

A * K = A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 * k 00 k 01 k 02 k 10 k 11 k 12 k 20 k 21 k 22

z 21 = A 10 k 00 + A 11 k 01 + A 12 k 02 + A 20 k 10 + A 21 k 11 + A 22 k 12 + A 30 k 20 + A 31 k 21 + A 32 k 22 。为了看到这一点,想象一下将内核K恰好放置在矩阵A的顶部及其中心 k 11 在之上 A 21 ,表示A的条目具有所需的索引,然后将彼此顶部的条目相乘并将所有结果加在一起。请注意,这里我们只计算了输出信号的一项,这意味着如果我们处理图像,它将只是滤波图像的一个像素的值。我们需要所有其他人!为此,我们必须知道哪些索引是有效的卷积。我们所说的有效是指full,因为它们在计算过程中考虑了所有内核条目。“有效”这个词有点误导,因为如果我们允许自己用零填充矩阵 A 的边界,则所有条目都是有效的。要找到考虑完整内核的索引,请回忆一下我们将K恰好放置在A之上的图像,其中K的中心位于我们要计算的索引处。通过这种放置, K的其余部分不应超出A的边界,因此对于我们的示例,好的索引将为 (1,1)、(1,2)、(2,1) 和 (2,2) ,产生输出:

is z 21 = a 10 k 00 + a 11 k 01 + a 12 k 02 + a 20 k 10 + a 21 k 11 + a 22 k 12 + a 30 k 20 + a 31 k 21 + a 32 k 22 . To see this, imagine placing the kernel K exactly on top of the matrix A with its center k 11 on top of a 21 , meaning the entry of A with the required index, then multiply the entries that are on top of each other and add all the results together. Note that here we only computed one entry of the output signal, meaning if we were working with images, it would be the value of only one pixel of the filtered image. We need all the others! For this, we have to know which indices are valid convolutions. By valid we mean full, as in they take all of the kernel entries into account during the computation. The word valid is a bit misleading since all entries are valid if we allow ourselves to pad the boundaries of matrix A by zeros. To find the indices that take the full kernel into account, recall our mental image of placing K exactly on top of A, with K’s center at the index that we want to compute. With this placement, the rest of K should not exceed the boundaries of A, so for our example, the good indices will be (1,1), (1,2), (2,1), and (2,2), producing the output:

Z = z 11 z 12 z 21 z 22

这意味着如果我们处理图像,过滤后的图像Z的大小将小于原始图像A。如果我们想要生成与原始图像大小相同的图像,那么我们必须在应用过滤器之前用零填充原始图像。对于我们的示例,我们需要在A的整个边界周围填充一层零,但如果K较大,那么我们将需要更多层零。下面是用一层零填充的A :

This means that if we were working with images, the filtered image Z would be smaller in size than the original image A. If we want to produce an image with the same size as the original image, then we must pad the original image with zeros before applying the filter. For our example, we would need to pad with one layer of zeros around the full boundary of A, but if K was larger, then we would need more layers of zeros. The following is A padded with one layer of zeros:

A pAdded = 0 0 0 0 0 0 0 A 00 A 01 A 02 A 03 0 0 A 10 A 11 A 12 A 13 0 0 A 20 A 21 A 22 A 23 0 0 A 30 A 31 A 32 A 33 0 0 0 0 0 0 0

尽管零填充是最简单且最流行的方法,但除了零填充之外,还有其他方法可以保留输出图像的大小。这些包括:

There are other ways to retain the size of the output image than zero padding, even though zero padding is the simplest and most popular approach. These include:

反射
Reflection

不要添加一层或多层零,而是添加一层或多层与图像的边界像素及其下方的值相同的层,依此类推(如果我们需要更多)。也就是说,不是关闭图像边界外部的像素,而是使用边界上或附近的相同像素来扩展图像。

Instead of adding a layer or multiple layers of zeros, add a layer or multiple layers of the same values as the boundary pixels of the image, and the ones below them, and so on if we need more. That is, instead of turning off the pixels outside an image boundary, extend the image using the same pixels that are already on or near the boundary.

环绕式
Wraparound

这用于周期性信号。从数学上讲,这是循环卷积、循环矩阵、产生循环矩阵特征值的离散傅里叶变换,以及包含其特征向量作为列的傅里叶矩阵。我们不会在这里深入探讨这一点,但请记住,周期性通常会简化事物,并使我们想到周期波,这反过来又使我们想到傅里叶的东西。

This is used for periodic signals. Mathematically, this is cyclic convolution, circulant matrix, discrete Fourier transform producing the eigenvalues of the circulant matrix, and the Fourier matrix containing its eigenvectors as columns. We will not dive into that here, but recall that periodicity usually simplifies things, and makes us think of periodic waves, which in turn makes us think of Fourier stuff.

多渠道
Multiple channels

应用多个通道的独立滤波器(权重矩阵),每个通道对原始输入进行采样,然后组合他们的输出。

Apply more than one channel of independent filters (weight matrices), each sampling the original input, then combine their outputs.

过滤图像

Filtering Images

图 5-1显示了示例:如何将同一图像与不同内核进行卷积来提取图像的各种特征。

Figure 5-1 shows an example of how convolving the same image with various kernels extracts various features of the image.

例如,表中的第三个内核的中心为 8,其余条目为 –1。这意味着该内核将当前像素的强度提高 8 倍,然后从中减去其周围所有像素的值。如果我们处于图像的统一区域,这意味着所有像素的值都相等或非常接近,那么此过程将给出零,返回黑色或关闭的像素。另一方面,如果该像素位于边缘边界上,例如眼睛的边界或鹿脸的边界,则卷积的输出将具有非零值,因此它将是明亮的像素。当我们将这个过程应用于整个图像时,结果是一个新图像,其中许多边缘用亮像素描绘,而图像的其余部分是暗的。

For example, the third kernel in the table has 8 at its center and the rest of its entries are –1. This means that this kernel makes the current pixel 8 times as intense, then subtracts from that the values of all the pixels surrounding it. If we are in a uniform region of the image, meaning all the pixels are equal or very close in value, then this process will give zero, returning a black or turned-off pixel. If, on the other hand, this pixel lays on an edge boundary, for example, the boundary of the eye or the boundary of the face of the deer, then the output of the convolution will have a nonzero value, so it will be a bright pixel. When we apply this process to the whole image, the result is a new image with many edges traced with bright pixels and the rest of the image dark.

埃麦0501
图 5-1。对图像应用各种滤镜(图像源

正如我们在表中看到的,能够检测边缘、模糊等的内核的选择并不是唯一的。同一张表包括用于模糊的二维离散高斯滤波器。当我们离散化一维高斯函数时,我们不会失去它的对称性,但是当我们离散化二维高斯函数时,我们会失去它的径向对称性,因为我们必须通过方阵来近似其自然的圆形或椭圆形。请注意,高斯在中心达到峰值,并随着远离中心而衰减。此外,其曲线(二维表面)下的面积为一。当我们将其与另一个信号进行卷积时,这具有平均和平滑(消除噪声)的总体效果。我们付出的代价是去除锐利的边缘,这正是模糊的含义(认为锐利的边缘被其本身以及距中心几个标准差距离内的所有周围像素的平滑衰减平均值所取代)。标准差(或方差,即标准差的平方)越小,我们可以从图像中保留更多的细节。

As we see in the table, the choice of the kernel that is able to detect edges, blur, etc., is not unique. The same table includes two-dimensional discrete Gaussian filters for blurring. When we discretize a one-dimensional Gaussian function, we do not lose its symmetry, but when we discretize a two-dimensional Gaussian function, we lose its radial symmetry since we have to approximate its natural circular or elliptical shape by a square matrix. Note that a Gaussian peaks at the center and decays as it spreads away from the center. Moreover, the area under its curve (surface for two dimensions) is one. This has the overall effect of averaging and smoothing (removing noise) when we convolve this with another signal. The price we pay is removing sharp edges, which is exactly what blurring is (think that a sharp edge is replaced by a smooth decaying average of itself and all the surrounding pixels within a distance of a few standard deviations from the center). The smaller the standard deviation (or the variance, which is the square of the standard deviation), the more detail we can retain from the image.

另一个很好的例子如图 5-2所示。在这里,右侧的每个图像都是左侧图像与每行中间描绘的滤波器(内核)之间的卷积结果。这些滤波器称为Gabor 滤波器。他们旨在拾取图像中的某些图案/纹理/特征,它们的工作方式与人类视觉系统中的过滤器类似。

Another great example is shown in Figure 5-2. Here, each image on the right is the result of the convolution between the image on the left and the filter (kernel) portrayed in the middle of each row. These filters are called Gabor filters. They are designed to pick up on certain patterns/textures/features in an image, and they work in a similar way to the filters found in the human visual system.

我们的眼睛检测图像中发生变化的部分,即对比度。它们拾取边缘(水平、垂直、对角线)和梯度(测量变化的陡度)。我们设计的滤波器在数学上做同样的事情,通过信号上的卷积来滑动它们。当信号没有任何变化时,它们会产生平滑且平稳的结果(值中的零或数字彼此接近),并且当边缘或梯度与内核对齐时会产生尖峰被检测到。

Our eyes detect the parts of an image where things change, i.e., contrasts. They pick up on edges (horizontal, vertical, diagonal) and gradients (measuring the steepness of a change). We design filters that do the same things mathematically, sliding them via convolution across a signal. These produce smooth and uneventful results (zeros or numbers close to each other in value) when nothing is changing in the signal, and spiking when an edge or a gradient that lines up with the kernel is detected.

埃麦0502
图 5-2。Gabor 滤波器应用于同一图像;我们设计过滤器来拾取图像中的不同纹理和方向。请访问本书的 GitHub 页面来重现这些图像。

特征图

Feature Maps

A卷积神经网络通过与全连接网络相同的方式优化损失函数来从数据中学习内核。进入训练函数公式的未知权重是每个卷积层的每个内核的条目、偏差以及与网络架构中涉及的任何全连接层相关的权重。卷积层(包括非线性激活函数)的输出称为特征图,学习的内核是特征检测器。神经网络中的一个常见观察结果是,网络中的早期层(靠近输入层)学习边缘等低级特征,后面的层学习形状等高级特征。这自然是预料之中的,因为在每个新层中,我们都使用非线性激活函数组成,因此多层的复杂性会增加,网络表达更复杂特征的能力也会增加。

A convolutional neural network learns the kernels from the data by optimizing a loss function in the same way a fully connected network does. The unknown weights that enter the formula of the training function are the entries of each kernel at each convolutional layer, the biases, and the weights related to any fully connected layer involved in the network’s architecture. The output of a convolutional layer (which includes the nonlinear activation function) is called a feature map, and the learned kernels are feature detectors. A common observation in neural networks is that earlier layers in the network (close to the input layer) learn low-level features such as edges, and later layers learn higher-level features such as shapes. This is naturally expected since at each new layer we compose with a nonlinear activation function, so complexity increases over multiple layers and so does the network’s ability to express more elaborate features.

我们如何绘制特征图?
How do we plot feature maps?

特征图帮助我们打开黑匣子并直接观察经过训练的网络在每个卷积层检测到的内容。如果网络仍处于训练过程中,特征图可以让我们查明误差源,然后相应地调整模型。假设我们将图像输入到经过训练的卷积神经网络。在第一层,使用卷积运算,内核滑动整个图像并生成新的滤波图像。然后,该过滤后的图像通过非线性激活函数,产生另一幅图像。最后,它通过池化层(很快就会解释),并产生卷积层的最终输出。此输出是与我们开始时不同的图像,并且可能具有不同的尺寸,但如果网络训练得好,输出图像将突出显示原始图像的一些重要特征,例如边缘、纹理等。此输出图像通常是数字矩阵或数字张量(对于彩色图像来说是三维的,如果我们处理批量图像或视频数据,则为四维,其中时间序列有一个额外的维度)。使用 Python 中的 matplotlib 库可以很容易地将这些可视化为特征图,其中矩阵或张量中的每个条目都映射到与矩阵条目相同位置的像素强度。图 5-3显示了卷积神经网络各个卷积层的各种特征图。

Feature maps help us open the black box and directly observe what a trained network detects at each of its convolutional layers. If the network is still in the training process, feature maps allow us to pinpoint the sources of error, then tweak the model accordingly. Suppose we input an image to a trained convolutional neural network. At the first layer, using the convolution operation, a kernel slides through the whole image and produces a new filtered image. This filtered image then gets passed through a nonlinear activation function, producing yet another image. Finally, this gets passed through a pooling layer, explained soon, and produces the final output of the convolutional layer. This output is a different image than the one we started with, and possibly of different dimensions, but if the network was trained well, the output image would highlight some important features of the original image, such as edges, texture, etc. This output image is usually a matrix of numbers or a tensor of numbers (three-dimensional for color images, or four-dimensional if we are working in batches of images or with video data, where there is one extra dimension for the time series). It is easy to visualize these as feature maps using the matplotlib library in Python, where each entry in the matrix or tensor is mapped to the intensity of a pixel in the same location as the matrix entry. Figure 5-3 shows various feature maps at various convolutional layers of a convolutional neural network.

埃麦0503
图 5-3。卷积神经网络各个卷积层的特征图(图像源

线性代数表示法

Linear Algebra Notation

我们现在知道卷积神经网络内部最基本的操作是卷积。给定一个滤波器k(可以是一维、二维、三维或更高维),卷积运算采用输入信号并通过在信号上滑动来将k应用到该输入信号上。此操作是线性的,这意味着每个输出都是输入分量的线性组合(通过k中的权重),因此它可以有效地表示为矩阵乘法。我们需要这种效率,因为我们仍然必须以易于评估和区分的方式编写表示卷积神经网络的训练函数。问题的数学结构仍然与我们现在非常熟悉的许多机器学习模型相同:

We now know that the most fundamental operation inside a convolutional neural network is the convolution. Given a filter k (could be one-dimensional, two-dimensional, three-dimensional, or higher), a convolution operation takes an input signal and applies k to it by sliding it across the signal. This operation is linear, meaning each output is a linear combination (by the weights in k) of the input components, thus it can be efficiently represented as a matrix multiplication. We need this efficiency, since we must still write the training function representing the convolutional neural network in a way that is easy to evaluate and differentiate. The mathematical structure of the problem is still the same as for many machine learning models, which we are very familiar with now:

训练功能
Training function

为了卷积神经网络,这通常包括输入分量的线性组合,由激活函数组成,然后是池化函数(很快讨论),在不同大小和连接的多层上,最后是逻辑函数,即支持向量机功能或其他功能,具体取决于网络的最终目的(分类、图像分割、数据生成等)。

与完全连接的神经网络的区别在于,线性组合现在发生在与滤波器相关的权重上,这意味着它们并不完全彼此不同(除非我们只做局部连接层而不是卷积层)。

此外,滤波器的尺寸通常比输入信号小几个数量级,因此当我们用矩阵表示法表达卷积运算时,矩阵中的大部分权重实际上为零。回想一下,网络每一层的每个输入特征都有一个权重。例如,如果输入是彩色图像,则每个通道中的每个像素都有不同的权重,除非我们决定实现卷积层,否则只会有几个唯一的权重,以及很多很多的零。这极大地简化了存储要求和计算时间,同时捕获重要的本地交互。

也就是说,在卷积神经网络的架构中,输出层附近通常有全连接层(为每个连接分配不同的权重)。人们可以将其合理化为特征的蒸馏:捕获重要且局部相关的特征(这些特征随着每一层的复杂性而增加)后,将这些特征组合起来进行预测。也就是说,当信息到达完全连接的层时,它会被提炼成最重要的组成部分,这些组成部分反过来又作为独特的特征,有助于最终的结果。预言。

For convolutional neural networks, this usually includes linear combinations of components of the input, composed with activation functions, then a pooling function (discussed soon), over multiple layers of various sizes and connections, and finally topped with a logistic function, a support vector machine function, or other functions depending on the ultimate purpose of the network (classification, image segmentation, data generation, etc.).

The difference from fully connected neural networks is that the linear combination happens now with the weights associated with the filter, which means they are not all different from each other (unless we are only doing locally connected layers instead of convolutional layers).

Moreover, the sizes of the filters are usually orders of magnitude smaller than the input signals, so when we express the convolution operation in matrix notation, most of the weights in the matrix will actually be zero. Recall that there is a weight per each input feature at each layer of a network. For example, if the input is a color image, there is a different weight for each pixel in each channel, unless we decide to implement a convolutional layer–then there would only be a few unique weights, and many, many zeros. This simplifies storage requirements and computation time tremendously, while at the same time capturing the important local interactions.

That said, it is common in the architecture of a convolutional neural network to have fully connected layers (with a different weight assigned to each connection) near the output layer. One can rationalize this as a distillation of features: after important and locally dependent features, which increase in complexity with each layer, are captured, these are combined to make a prediction. That is, when information arrives to a fully connected layer, it would have been distilled to its most important components, which in turn act as unique features contributing to a final prediction.

损失函数
Loss function

这是与我们在前面的章节中讨论的所有损失函数类似,总是提供网络预测和基本事实所产生的误差的度量。

This is similar to all the loss functions we went over in the previous chapters, always providing a measure of the error made by the network’s predictions and ground truths.

优化
Optimization

又是这个由于问题的规模和涉及的变量数量巨大,因此使用随机梯度下降。在这里,像往常一样,我们需要评估损失函数的一阶导数,其中包括训练函数的一阶导数。我们计算所有网络层和通道的滤波器中涉及的所有未知权重以及相关偏差的导数。在计算上,反向传播算法仍然是微分过程的主力。

This again uses stochastic gradient descent, due to the enormity of the size of the problem and the number of variables that go into it. Here, as usual, we need to evaluate one derivative of the loss function, which includes one derivative of the training function. We compute the derivative with respect to all the unknown weights involved in the filters of all the network’s layers and channels, and the associated biases. Computationally, the backpropagation algorithm is still the workhorse of the differentiation process.

线性代数和计算线性代数拥有我们生成可训练卷积神经网络所需的所有工具。一般来说,我们的计算中涉及的最糟糕的矩阵是密集的(大部分是非零条目)矩阵,其条目没有明显的结构或模式(如果它是不可对角化的,则更糟,等等)。但是,当我们有一个稀疏矩阵(大部分为零),或一个其条目具有某种模式的矩阵(例如对角线,三对角线,圆形等),或两者兼而有之时,那么我们就处于一个计算友好的世界中,因为我们了解如何利用特殊的矩阵结构来发挥我们的优势。研究大型矩阵计算和算法的研究人员是这种必要开发的国王和王后,如果没有他们的工作,我们将留下很难在实践和大规模实施的理论。

Linear algebra and computational linear algebra have all the tools we need to produce trainable convolutional neural networks. In general, the worst kind of matrix to be involved in our computations is a dense (mostly nonzero entries) matrix with no obvious structure or pattern to its entries (even worse if it is nondiagonalizable, etc.). But when we have a sparse matrix (mostly zeros), or a matrix with a certain pattern to its entries (such as diagonal, tridiagonal, circular, etc.), or both, then we are in a computationally friendly world, given that we learn how to exploit the special matrix structure to our advantage. Researchers who study large matrix computations and algorithms are the kings and queens of such necessary exploitation, and without their work we’d be left with theory that is very hard to implement in practice and at scale.

在一维中,卷积运算可以使用一种特殊的矩阵来表示,称为托普利茨矩阵;在二维中,它可以使用另一种特殊的矩阵来表示,称为双块循环矩阵。让我们只关注这两个,但从教训来看,矩阵表示法一般来说是最好的方法,如果我们不去发现和充分利用矩阵中的固有结构,我们就是傻瓜。换句话说,首先处理最普遍的情况可能只是浪费时间,而这恰好是此生中的稀有商品。使用最一般的矩阵和最具体的矩阵之间的一个很好的折衷是伴随结果进行复杂性分析,例如计算方法的阶数( n 3 或者 n 日志 n 等),以便利益相关者意识到实施某些方法与其他方法之间的权衡。

In one dimension, the convolution operation can be represented using a special kind of matrix, called the Toeplitz matrix; and in two dimensions, it can be represented using another special kind of matrix, called a doubly block circulant matrix. Let’s only focus on these two, but with the take-home lesson that matrix notation, in general, is the best way to go, and we would be fools not to discover and make the most use of the inherent structure within our matrices. In other words, attacking the most general cases first might just be a sad waste of time, which happens to be a rare commodity in this life. A good compromise between working with the most general matrix and the most specific matrix is to accompany results with complexity analysis, such as computing the order of a method ( O ( n 3 ) or O ( n log n ) , etc.), so that stakeholders are made aware of the trade-offs between implementing certain methods versus others.

我们借用Ian Goodfellow 等人的免费电子书《深度学习》(第 9 章,第 334 页)中的一个简单示例。(麻省理工学院出版社),关于使用卷积或利用矩阵中的许多零与使用通常的矩阵乘法来检测特定图像内的垂直边缘的效率。卷积是描述变换的一种极其有效的方法,它对整个输入的小局部区域应用相同的线性变换。

We borrow a simple example from the free ebook Deep Learning (Chapter 9, page 334) by Ian Goodfellow et al. (MIT Press), on the efficiency of using either convolution or exploiting the many zeros in a matrix versus using the usual matrix multiplication to detect vertical edges within a certain image. Convolution is an extremely efficient way of describing transformations that apply the same linear transformation of a small local region across the entire input.

图 5-4右侧的图像是通过获取原始图像中的每个像素并减去左侧相邻像素的值而形成的。这显示了输入图像中所有垂直方向边缘的强度,这对于对象检测来说是有用的操作。

The image on the right of Figure 5-4 was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left. This shows the strength of all the vertically oriented edges in the input image, which can be a useful operation for object detection.

埃麦0504
图 5-4。检测图像中的垂直边缘(图像源

两张图片的高度均为 280 像素。输入图像宽 320 像素,输出图像宽 319 像素。这种变换可以用包含两个元素的卷积核来描述,并且需要 319 × 280 × 3 = 267,960 次浮点运算(每个输出像素两次乘法和一次加法)才能使用卷积进行计算。

用矩阵乘法描述相同的变换需要 320 × 280 × 319 × 280,或者矩阵中超过 80 亿个条目,使得卷积表示该变换的效率提高了 40 亿倍。简单的矩阵乘法算法执行超过 160 亿次浮点运算,使卷积的计算效率提高大约 60,000 倍。当然,矩阵的大多数条目为零。如果我们只存储矩阵的非零项,那么矩阵乘法和卷积都需要相同数量的浮点运算来计算。该矩阵仍需要包含 2 × 319 × 280 = 178,640 个条目。

深度学习作者:Ian Goodfellow 等人。

Both images are 280 pixels tall. The input image is 320 pixels wide, while the output image is 319 pixels wide. This transformation can be described by a convolution kernel containing two elements, and requires 319 × 280 × 3 = 267,960 floating-point operations (two multiplications and one addition per output pixel) to compute using convolution.

To describe the same transformation with a matrix multiplication would take 320 × 280 × 319 × 280, or over eight billion entries in the matrix, making convolution four billion times more efficient for representing this transformation. The straightforward matrix multiplication algorithm performs over 16 billion floating-point operations, making convolution roughly 60,000 times more efficient computationally. Of course, most of the entries of the matrix would be zero. If we stored only the nonzero entries of the matrix, then both matrix multiplication and convolution would require the same number of floating-point operations to compute. The matrix would still need to contain 2 × 319 × 280 = 178,640 entries.

Deep Learning, by Ian Goodfellow et al.

一维情况:托普利茨矩阵的乘法

The One-Dimensional Case: Multiplication by a Toeplitz Matrix

带状托普利茨矩阵好像:

A banded Toeplitz matrix looks like:

时间 e p t z = k 0 k 1 k 2 0 0 0 0 0 k 0 k 1 k 2 0 0 0 0 0 k 0 k 1 k 2 0 0 0 0 0 k 0 k 1 k 2 0 0 0 0 0 k 0 k 1 k 2

将此托普利茨矩阵乘以一维信号 X = X 0 , X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , X 7 产生一维滤波器卷积的精确结果 k = k 1 , k 2 , k 3 与信号x,即 时间 e p t z X t = k * X 执行乘法,我们可以看到滤波器在信号上的滑动效果。

Multiplying this Toeplitz matrix by a one-dimensional signal x = ( x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 ) yields the exact result of the convolution of a one-dimensional filter k = ( k 1 , k 2 , k 3 ) with the signal x, namely, ( T o e p l i t z ) x t = k * x . Performing the multiplication, we see the sliding effect of the filter across the signal.

二维情况:双块循环矩阵的乘法

The Two-Dimensional Case: Multiplication by a Doubly Block Circulant Matrix

二维的模拟涉及二维卷积运算和图像滤波。在这里,我们最终不是与托普利茨矩阵相乘,而是与双块循环矩阵相乘,其中每一行都是给定向量的循环移位。将这个矩阵及其与二维卷积的等价写下来是线性代数中的一个很好的练习。在深度学习中,我们最终学习权重,即这些矩阵的条目。这种线性代数符号(以托普利茨或循环矩阵的形式)帮助我们找到损失函数关于这些的导数的紧凑公式重量。

The two-dimensional analog involves the two-dimensional convolution operation and filtering images. Here, instead of multiplying with a Toeplitz matrix, we end up multiplying with a doubly block circulant matrix, where each row is a circular shift of a given vector. It is a nice exercise in linear algebra to write this matrix down along with its equivalence to two-dimensional convolution. In deep learning, we end up learning the weights, which are the entries of these matrices. This linear algebra notation (in terms of Toeplitz or circulant matrices) helps us find compact formulas for the derivatives of the loss function with respect to these weights.

池化

Pooling

几乎所有卷积神经网络的共同步骤是池化。这通常是在输入通过卷积过滤然后通过非线性激活函数之后实现的。池化的类型不止一种,但其思想是相同的:用附近输出的汇总统计替换特定位置的当前输出。图像的一个示例是将四个像素替换为包含原始四个像素的最大值(最大池化)的一个像素,或者替换为它们的平均值,或者加权平均值,或者它们的平方和的平方根等。

One step common to almost all convolutional neural networks is pooling. This is typically implemented after the input gets filtered via convolution, then passed through the nonlinear activation function. There is more than one type of pooling, but the idea is the same: replace the current output at a certain location with a summary statistic of the nearby outputs. An example for images is to replace four pixels with one pixel containing the maximum value of the original four (max pooling), or with their average value, or a weighted average, or the square root of the sum of their squares, etc.

图 5-5显示了最大池化的工作原理。

Figure 5-5 shows how max pooling works.

埃麦0505
图 5-5。最大池化(图片来源

实际上,这减少了维度并总结了输出的整个邻域,但代价是牺牲了细节。因此,对于需要精细细节进行预测的用例来说,池化并不是很好。尽管如此,池化还是有很多优点:

In effect, this reduces the dimension and summarizes whole neighborhoods of outputs, at the expense of sacrificing fine detail. So pooling is not very good for use cases where fine detail is essential for making predictions. Nevertheless, pooling has many advantages:

  • 它为输入的小空间平移提供了近似不变性。如果我们更关心某个特征是否存在而不是其确切位置,这会很有用。

  • It provides approximate invariance to small spatial translations of the input. This is useful if we care more about whether some feature is present than its exact location.

  • 可以大大提高网络的统计效率。

  • It can greatly improve the statistical efficiency of the network.

  • 它提高了网络的计算效率和内存需求,因为它减少了下一层的输入数量。

  • It improves the computational efficiency and memory requirements of the network because it reduces the number of inputs to the next layer.

  • 它有助于处理不同大小的输入,因为我们可以控制池化邻域的大小以及池化后输出的大小。

  • It helps with handling inputs of varying sizes, because we can control the size of the pooled neighborhoods and the size of the output after pooling.

用于图像分类的卷积神经网络

A Convolutional Neural Network for Image Classification

这是如果不偏离本书的主要目的:理解不同模型背后的数学,就不可能详尽地列出神经网络中涉及的所有不同架构和变体。然而,我们可以回顾一下基本组件以及它们如何组合在一起来完成人工智能任务,例如计算机视觉的图像分类。训练过程中,第4章的步骤仍然适用:

It is impossible to have an exhaustive list of all the different architectures and variations that are involved in neural networks without diverting from the main purpose of the book: understanding the mathematics that underlies the different models. It is possible, however, to go over the essential components and how they all come together to accomplish an AI task, such as image classification for computer vision. During the training process, the steps of Chapter 4 still apply:

  1. 初始化随机权重(根据我们在第 4 章中描述的初始化过程)。

  2. Initialize random weights (according to the initialization processes we described in Chapter 4).

  3. 将一批图像前向传递通过卷积网络并输出图像的类别。

  4. Forward pass a batch of images through the convolutional network and output a class for the image.

  5. 评估该特定权重选择的损失函数。

  6. Evaluate the loss function for this particular choice of weights.

  7. 通过网络反向传播误差。

  8. Backpropagate the error through the network.

  9. 调整导致误差的权重(随机梯度下降)。

  10. Adjust the weights that contributed to the error (stochastic gradient descent).

  11. 重复一定次数的迭代,或者直到收敛。

  12. Repeat for a certain number of iterations, or until you converge.

值得庆幸的是,我们不需要自己做这些事情。Python 的 Keras 库有许多预训练的模型,这意味着它们的权重已经固定,我们要做的就是在特定数据集上评估训练后的模型。

Thankfully, we do not have to do any of this on our own. Python’s Keras library has many pre-trained models, which means that their weights have already been fixed, and all we have to do is evaluate the trained model on our particular data set.

我们能够而且应该做的是观察和学习成功和获胜网络的架构。图 5-6显示了LeCun 等人提出的LeNet1的简单架构。(1989),图 5-7显示了 AlexNet (2012) 的架构。

What we can and should do is observe and learn the architecture of successful and winning networks. Figure 5-6 shows the simple architecture of LeNet1 by LeCun et al. (1989), and Figure 5-7 shows AlexNet’s (2012) architecture.

一个很好的练习是计算进入 LeNet1 和 AlexNet 训练函数的权重数量。请注意,每层(特征图)中的单元越多意味着权重越大。当我尝试根据图5-6的架构来计算LeNet1中涉及的权重时,我最终得到了9,484个权重,但原始论文提到了9,760个权重,所以我不知道其余的权重在哪里。如果您找到它们,请告诉我。不管怎样,关键是我们需要解决一个优化问题 9760 现在对图 5-7中的 AlexNet 进行相同的计算:我们有大约 6230 万个权重,因此优化问题最终为 623n 。另一个惊人的数字:我们需要 11 亿个计算单元才能通过网络进行一次前向传播。

A good exercise is to count the number of the weights that go into the training function of LeNet1 and AlexNet. Note that more units in each layer (feature maps) means more weights. When I tried to count the weights involved in LeNet1 based on the architecture in figure Figure 5-6, I ended up with 9,484 weights, but the original paper mentions 9,760 weights, so I do not know where the rest of the weights are. If you find them please let me know. Either way, the point is we need to solve an optimization problem in 9760 . Now do the same computation for AlexNet in Figure 5-7: we have around 62.3 million weights, so the optimization problem ends up in 62.3million . Another startling number: we need 1.1 billion computation units for one forward pass through the net.

埃麦0506
图 5-6。LeNet1的架构(1989)
埃麦0507
图 5-7。AlexNet 的架构拥有高达 6230 万个权重(2012 年);改编自https://oreil.ly/eEWgJ

图 5-8展示了手写数字 8 的图像通过预训练的 LeNet1 并最终被正确分类为 8 的精彩示例。

Figure 5-8 shows a wonderful illustration of an image of the handwritten digit 8 passing through a pre-trained LeNet1 and ultimately getting correctly classified as 8.

最后,如果架构的选择对您来说似乎是任意的,这意味着如果您想知道我们是否可以通过更简单的架构实现类似的性能,请加入俱乐部。整个社区都在想一样。

Finally, if the choice of architecture seems arbitrary to you, meaning that if you are wondering whether we could accomplish similar performance with simpler architecture, join the club. The whole community is wondering the same thing.

埃麦0508
图 5-8。通过预训练的 LeNet1 传递手写 8 的图像(图像来源

总结与展望

Summary and Looking Ahead

在本章中,我们定义了卷积运算:卷积神经网络最重要的组成部分。卷积神经网络对于计算机视觉、机器音频处理和其他人工智能应用至关重要。

In this chapter, we defined the convolution operation: the most significant component of convolutional neural networks. Convolutional neural networks are essential for computer vision, machine audio processing, and other AI applications.

我们从系统设计的角度提出了卷积,然后通过过滤一维和二维信号。我们强调了卷积运算的线性代数等效项(乘以特殊结构的矩阵),并以图像分类的示例结束。

We presented convolution from a systems design perspective, then through filtering one-dimensional and two-dimensional signals. We highlighted the linear algebra equivalent of the convolution operation (multiplying by matrices of special structure), and ended with an example of image classification.

我们将在本书中经常遇到卷积神经网络,因为它们已成为包括视觉和/或自然语言在内的许多人工智能系统的主要组成部分。

We will encounter convolutional neural networks frequently in this book, as they have become a staple in many AI systems that include vision and/or natural language.

第 6 章奇异值分解:图像处理、自然语言处理和社交媒体

Chapter 6. Singular Value Decomposition: Image Processing, Natural Language Processing, and Social Media

向我展示最重要的部分,而且只展示最重要的部分。

H。

Show me the essential, and only the essential.

H.

奇异值分解是线性代数的数学运算,广泛应用于数据科学、机器学习和人工智能领域。它是主成分分析(在数据分析中)和潜在语义分析(在自然语言处理中)背后的数学。此操作将稠密矩阵转换为对角矩阵。在线性代数中,对角矩阵非常特殊且非常理想。当我们乘以它们时,它们的行为就像标量,只是在某些方向上拉伸或挤压。

The singular value decomposition is a mathematical operation from linear algebra that is widely applicable in the fields of data science, machine learning, and artificial intelligence. It is the mathematics behind principal component analysis (in data analysis) and latent semantic analysis (in natural language processing). This operation transforms a dense matrix into a diagonal matrix. In linear algebra, diagonal matrices are very special and highly desirable. They behave like scalar numbers when we multiply by them, only stretching or squeezing in certain directions.

当计算矩阵的奇异值分解时,我们得到了揭示和量化矩阵对空间本身的作用的额外好处:旋转、反射、拉伸和/或挤压。空间不存在扭曲(弯曲),因为该运算是线性的(毕竟,它被称为线性代数)。与其他方向相比,一个方向的极端拉伸或挤压会影响涉及矩阵的任何计算的稳定性,因此对其进行测量可以让我们直接控制计算对各种扰动(例如噪声测量)的敏感度。

When computing the singular value decomposition of a matrix, we get the extra bonus of revealing and quantifying the action of the matrix on space itself: rotating, reflecting, stretching, and/or squeezing. There is no warping (bending) of space, since this operation is linear (after all, it is called linear algebra). Extreme stretching or squeezing in one direction versus the others affects the stability of any computations involving our matrix, so having a measure of that allows us direct control over the sensitivity of our computations to various perturbations, for example, noisy measurements.

奇异值分解的强大之处在于它可以应用于任何矩阵。这一点及其在人工智能领域的广泛应用使其在本书中占有一章。在下面的部分中,我们将探讨奇异值分解,重点关注大局而不是微小的细节,以及在图像处理、自然语言处理和社交媒体中的应用。

The power of the singular value decomposition lies in the fact that it can be applied to any matrix. That and its wide use in the field of AI earns it its own chapter in this book. In the following sections, we explore the singular value decomposition, focusing on the big picture rather than the tiny details, and on applications to image processing, natural language processing, and social media.

给定一个矩阵C(图像、数据矩阵等),我们省略计算其奇异值分解的细节。大多数线性代数书籍都这样做,提出了一种基于计算对称矩阵的特征向量和特征值的理论方法 C t C C C t ,对我们来说是数据的协方差矩阵(如果数据居中)。虽然理解该理论仍然非常重要,但它提供的计算奇异值分解的方法对于高效计算来说并无用处,尤其是对于许多现实问题中涉及的大型矩阵来说是不可能的。此外,我们生活在一个软件包可以帮助我们轻松计算的时代。在Python中,我们所要做的就是调用numpynumpy.linalg.svd库中的方法。我们将在本章后面简要介绍这些软件包中的数值算法。然而,我们的主要重点是理解奇异值分解的工作原理,以及为什么这种分解对于减少给定问题的存储和计算要求而不丢失其基本信息很重要。我们还将了解它所扮演的角色聚类数据。

Given a matrix C (an image, a data matrix, etc.), we omit the details of computing its singular value decomposition. Most linear algebra books do that, presenting a theoretical method based on computing the eigenvectors and eigenvalues of the symmetric matrices C t C and C C t , which for us are the covariance matrices of the data (if the data is centered). While it is still very important to understand the theory, the method it provides for computing the singular value decomposition is not useful for efficient computations, and is especially impossible for the large matrices involved in many realistic problems. Moreover, we live in an era where software packages help us compute it so easily. In Python, all we have to do is call the numpy.linalg.svd method from the numpy library. We peek briefly into the numerical algorithms that go into these software packages later in this chapter. However, our main focus is on understanding how the singular value decomposition works and why this decomposition is important for reducing the storage and computational requirements of a given problem without losing its essential information. We will also understand the role it plays in clustering data.

矩阵分解

Matrix Factorization

我们可以以多种方式分解标量;例如,我们可以写数字 12 = 4 × 3 , 12 = 2 × 2 × 3 , 或者 12 = 0 5 × 24 。哪种分解更好取决于我们的用例。对于数字矩阵也可以这样做。线性代数为我们提供了各种有用的矩阵分解。这个想法是,我们想要将一个对象分解成更小的组件,这些组件让我们深入了解对象本身的功能和行为。这种细分还让我们很好地了解哪些组件包含最多信息,因此比其他组件更重要。在这种情况下,我们可能会受益于丢弃不太重要的组件,并构建具有类似功能的较小对象。较小的对象可能不像我们开始时的对象那么详细,因为它包含了它的所有组件;然而,它包含来自原始对象的足够重要信息,因此以其较小的尺寸使用它会带来好处。奇异值分解是一种矩阵分解,它正是这样做的。它的公式如下:

We can factorize a scalar number in multiple ways; for instance, we can write the number 12 = 4 × 3 , 12 = 2 × 2 × 3 , or 12 = 0 . 5 × 24 . Which factorization is better depends on our use case. The same can be done for matrices of numbers. Linear algebra provides us with a variety of useful matrix factorizations. The idea is that we want to break down an object into its smaller components, and these components give us insight about the function and the action of the object itself. This breakdown also gives us a good idea about which components contain the most information, and thus are more important than others. In this case, we might benefit from throwing away the less important components, and building a smaller object with a similar function. The smaller object might not be as detailed as the object that we started with, as that contained all of its components; however, it contains enough significant information from the original object that using it with its smaller size provides benefits. The singular value decomposition is a matrix factorization that does exactly that. Its formula looks like:

C ×n = U × Σ ×n V n×n t ,

我们将矩阵C分解为三个分量矩阵:U Σ , 和 V t UV是具有正交行和列的方阵。 Σ 是一个与C形状相同的对角矩阵(见图6-1)。

where we break down the matrix C into three component matrices: U, Σ , and V t . U and V are square matrices that have orthonormal rows and columns. Σ is a diagonal matrix that has the same shape as C (see Figure 6-1).

让我们从矩阵乘法开始。假设A是一个 3 行3 列的矩阵:

Let’s start with matrix multiplication. Suppose A is a matrix with 3 rows and 3 columns:

A = 123456789 3×3

B一个 3 行 2 列的矩阵:

and B is a matrix with 3 rows and 2 columns:

= 134-201 3×2

那么C = AB是一个 3 行 2 列的矩阵:

Then C = AB is a matrix with 3 rows and 2 columns:

C 3×2 = A 3×3 3×2 = 123456789 3×3 134-201 3×2 = 922483914 3×2

我们可以将C视为被分解为AB的乘积,就像数字一样 12 = 4 × 3 。之前对C进行因式分解没有意义,因为AB都不是特殊类型的矩阵。C的一个非常重要的因式分解是其奇异值分解。任何矩阵都具有奇异值分解。我们使用 Python 计算它(有关代码,请参阅相关的 Jupyter 笔记本):

We can think of C as being factorized into the product of A and B, in the same way as the number 12 = 4 × 3 . The previous factorization of C has no significance, since neither A nor B is a special type of matrix. A very significant factorization of C is its singular value decomposition. Any matrix has a singular value decomposition. We calculate it using Python (see the associated Jupyter notebook for the code):

C 3×2 = U 3×3 Σ 3×2 V 2×2 t = - 0 1853757 0 8938507 0 4082482 - 0 5120459 0 2667251 - 0 8164965 - 0 8387161 - 0 3604005 0 4082482 49 402266 0 0 1 189980 0 0 - 0 9446411 - 0 3281052 0 3281052 - 0 9446411

在此分解中观察以下内容: V t 右奇异向量(这些正是V的列), U的列是左奇异向量,并且对角线条目 Σ 奇异值。奇异值始终为正,并且始终沿对角线按降序排列 Σ 。最大奇异值与最小奇异值的比值就是条件数 κ 的矩阵。在我们的例子中只有两个奇异值,并且 κ = 49402266 1189980 = 41 515207 。这个数字对于涉及我们矩阵的计算的稳定性起着重要作用。条件良好的矩阵是指条件数不是很大的矩阵。

Observe the following in this decomposition: the rows of V t are the right singular vectors (these are exactly the columns of V), the columns of U are the left singular vectors, and the diagonal entries of Σ are the singular values. The singular values are always positive and always arranged in decreasing order along the diagonal of Σ . The ratio of largest singular value to the smallest singular value is the condition number κ of a matrix. In our case there are only two singular values, and κ = 49.402266 1.189980 = 41 . 515207 . This number plays an important role in the stability of computations involving our matrix. Well-conditioned matrices are those with condition numbers that are not very large.

左奇异向量是正交的(彼此正交且长度为 1)。类似地,右奇异向量也是正交的。

The left singular vectors are orthonormal (orthogonal to each other and have length 1). Similarly, the right singular vectors are also orthonormal.

对于定性属性,评估图像比无休止的数字数组更快。使用Python很容易将矩阵可视化为图像(反之亦然,图像存储为数字矩阵):矩阵条目的值对应于相应像素的强度。数字越高,像素越亮。矩阵中较小的数字显示为较暗的像素,较大的数字显示为较亮的像素。图 6-1显示了前面提到的奇异值分解。我们观察到对角矩阵 Σ 具有与C相同的形状,其对角线条目按降序排列,最亮的像素对应于最大奇异值,位于左上角。

For qualitative properties, images are faster to assess than endless arrays of numbers. It is easy to visualize matrices as images using Python (and vice versa, images are stored as matrices of numbers): the value of an entry of the matrix corresponds to the intensity of the corresponding pixel. The higher the number, the brighter the pixel. Smaller numbers in the matrix show up as darker pixels, and larger numbers show up as brighter pixels. Figure 6-1 shows the previously mentioned singular value decomposition. We observe that the diagonal matrix Σ has the same shape as C, and its diagonal entries are arranged in decreasing order, with the brightest pixel, corresponding to the largest singular value, at the top-left corner.

埃麦0601
图 6-1。可视化奇异值分解

6-26-3可视化了两个矩形矩阵AB的奇异值分解,其中A是宽的,B是高的:

Figures 6-2 and 6-3 visualize the singular value decompositions of two rectangular matrices A and B, where A is wide and B is tall:

A = -13-54181-240-7204-3-8 3×5 = U 3×3 Σ 3×5 V 5×5 t
= 5440710-18 4×2 = U 4×4 Σ 4×2 V 2×2 t

图 6-2中,我们注意到最后两列 Σ 都是零(黑色像素),因此我们可以节省存储空间并丢弃这两列以及最后两行 V t (请参阅下一节,了解从左侧乘以对角矩阵的情况)。同样,在图 6-3中,我们注意到最后两行 Σ 都是零(黑色像素),因此我们可以节省存储空间并丢弃这两行以及最后两列 U (请参阅下一节,了解如何乘以右侧的对角矩阵)。奇异值分解已经为我们节省了一些空间(请注意,我们通常只存储 Σ 与所有的整个矩阵相反它的零)。

In Figure 6-2, we note that the last two columns of Σ are all zeros (black pixels), hence we can economize in storage and throw away these two columns along with the last two rows of V t (see the next section for multiplying by a diagonal matrix from the left). Similarly, in Figure 6-3, we note that the last two rows of Σ are all zeros (black pixels), hence we can economize in storage and throw away these two rows along with the last two columns of U (see the next section for multiplying by a diagonal matrix from the right). The singular value decomposition is already saving us some space (note that we usually only store the diagonal entries of Σ as opposed to the whole matrix with all its zeros).

埃麦0602
图 6-2。可视化宽矩形矩阵的奇异值分解。最后两列 Σ 全部为零(黑色像素),允许减少存储:丢弃最后两列 Σ 以及最后两行 V t
埃麦0603
图 6-3。可视化高矩形矩阵的奇异值分解。最后两行 Σ 全部为零(黑色像素),允许减少存储:丢弃最后两行 Σ 以及最后两列 U

对角矩阵

Diagonal Matrices

什么时候我们将一个向量乘以一个标量,例如 3,我们得到一个沿相同方向、相同方向的新向量,但其长度被拉伸三倍。当我们将同一个向量乘以另一个标量(例如 –0.5)时,我们会得到另一个向量,同样沿着相同的方向,但这次它的长度减半并且方向翻转。乘以标量是一个非常简单的操作,如果我们的矩阵在将它们应用于(换句话说,乘以)向量时表现得同样容易,那就太好了。如果我们的生活是一维的,我们只需要处理标量,但由于我们的生活和感兴趣的应用是高维的,那么我们必须用对角矩阵来满足自己(图 6-4 。这些都是好的。

When we multiply a vector by a scalar number, say 3, we obtain a new vector along the same direction with the same orientation, but whose length is stretched three times. When we multiply the same vector by another scalar number, say –0.5, we get another vector, again along the same direction, but this time its length is halved and its orientation is flipped. Multiplying by a scalar number is such a simple operation, and it would be nice if we had matrices that behaved equally easily when we applied them to (in other words, multiplied them by) vectors. If our life was one-dimensional we would only have to deal with scalar numbers, but since our life and applications of interest are higher dimensional, then we have to satisfy ourselves with diagonal matrices (Figure 6-4). These are the good ones.

埃麦0604
图 6-4。一个图像 5 × 4 具有对角项的对角矩阵:10(最亮的像素)、6、3 和 1(除零之外的最暗像素)

乘以对角矩阵对应于空间中某些方向的拉伸或挤压,方向翻转对应于对角线上的任何负数。众所周知,大多数矩阵都不是对角矩阵。奇异值分解的强大之处在于,它为我们提供了空间中的方向,沿着该方向矩阵的行为就像(尽管是广义上的)对角矩阵。对角矩阵通常沿与矢量坐标相同的方向拉伸/压缩。另一方面,如果矩阵不是对角矩阵,则它通常不会在与坐标相同的方向上拉伸/挤压。在改变坐标,它会在其他方向上这样做。奇异值分解为我们提供了所需的坐标变化(右奇异向量)、向量拉伸/压缩的方向(左奇异向量)以及拉伸/压缩的幅度(奇异值)。我们将在下一节中详细介绍这一点,但我们首先从左侧和右侧阐明对角矩阵的乘法。

Multiplying by a diagonal matrix corresponds to stretching or squeezing in certain directions in space, with orientation flipping corresponding to any negative numbers on the diagonal. As we very well know, most matrices are very far from being diagonal. The power of the singular value decomposition is that it provides us with the directions in space along which the matrix behaves like (albeit in a broad sense) a diagonal matrix. A diagonal matrix usually stretches/squeezes in the same directions as those for the vector coordinates. If, on the other hand, the matrix is not diagonal, it generally does not stretch/squeeze in the same directions as the coordinates. It does so in other directions, after a change of coordinates. The singular value decomposition gives us the required coordinate change (right singular vectors), the directions along which vectors will be stretched/squeezed (left singular vectors), as well as the magnitude of the stretch/squeeze (singular values). We detail this in the next section, but we first clarify multiplication by diagonal matrices from the left and from the right.

如果我们乘以一个矩阵 A 通过对角矩阵 Σ 从右边开始, A Σ ,然后我们缩放列 A σ 的,例如:

If we multiply a matrix A by a diagonal matrix Σ from the right, A Σ , then we scale the columns of A by the σ ’s, for example:

A Σ = A 11 A 12 A 21 A 22 A 31 A 32 σ 1 0 0 σ 2 = σ 1 A 11 σ 2 A 12 σ 1 A 21 σ 2 A 22 σ 1 A 31 σ 2 A 32

如果我们乘以 A 经过 Σ 从左边 Σ A ,然后我们缩放行 A σ 的,对于例子:

If we multiply A by Σ from the left Σ A , then we scale the rows of A by the σ ’s, for example:

Σ A = σ 1 0 0 0 σ 2 0 0 0 σ 3 A 11 A 12 A 21 A 22 A 31 A 32 = σ 1 A 11 σ 1 A 12 σ 2 A 21 σ 2 A 22 σ 3 A 31 σ 3 A 32

矩阵作为作用于空间的线性变换

Matrices as Linear Transformations Acting on Space

单程我们可以将矩阵视为作用于空间中的向量以及空间本身的线性变换(无扭曲)。如果不允许扭曲,因为它会导致操作非线性,那么允许哪些操作?答案是旋转、反射、拉伸和/或挤压,这些都是非扭曲操作。奇异值分解 A = U Σ V t 抓住了这个概念。当A作用于向量时 v ,让我们回顾一下乘法 A v = U Σ V t v 一步步:

One way we can view matrices is as linear transformations (no warping) that act on vectors in space, and on space itself. If no warping is allowed because it would render an operation nonlinear, then what actions are allowed? The answers are rotation, reflection, stretching, and/or squeezing, which are all nonwarping operations. The singular value decomposition A = U Σ V t captures this concept. When A acts on a vector v , let’s go over the multiplication A v = U Σ V t v step-by-step:

  1. 第一的 v 由于正交矩阵而被旋转/反射 V t

  2. First v gets rotated/reflected because of the orthogonal matrix V t .

  3. 然后由于对角矩阵,它会沿着特殊方向拉伸/挤压 Σ

  4. Then it gets stretched/squeezed along special directions because of the diagonal matrix Σ .

  5. 最后,由于另一个正交矩阵U ,它再次旋转/反射。

  6. Finally, it gets rotated/reflected again because of the other orthogonal matrix U.

反射和旋转并不会真正改变空间,因为它们保留了大小和对称性(想象一下旋转物体或在镜子中查看其反射)。对角矩阵中编码的拉伸和/或挤压量 Σ (通过对角线上的奇异值)对于A的作用非常有用。

Reflections and rotations do not really change space, as they preserve size and symmetries (think of rotating an object or looking at its reflection in a mirror). The amount of stretch and/or squeeze encoded in the diagonal matrix Σ (via its singular values on the diagonal) is very informative regarding the action of A.

正交矩阵

Orthogonal Matrix

正交矩阵有正交行和正交列。它永远不会拉伸或挤压,它只会旋转和/或反射,这意味着它在作用于物体时不会改变物体的大小和形状,只会改变它们的方向和/或方向。与数学中的许多事物一样,这些名称令人困惑。即使它的行和列是正交矩阵,也称为正交矩阵,这意味着正交长度等于 1。另一个有用的事实:如果C是正交矩阵,那么 C C t = C t C = ,也就是说,该矩阵的逆是其转置。计算矩阵的逆通常是一个非常昂贵的操作,但对于正交矩阵,我们所要做的就是将其行交换为列。

An orthogonal matrix has orthonormal rows and orthonormal columns. It never stretches or squeezes, it only rotates and/or reflects, meaning that it does not change the size and shape of objects when acting on them, only their direction and/or orientation. As with many things in mathematics, these names are confusing. It is called an orthogonal matrix even though its rows and columns are an orthonormal, which means orthogonal and of length equal to one. One more useful fact: if C is an orthogonal matrix, then C C t = C t C = I , that is, the inverse of this matrix is its transpose. Computing the inverse of a matrix is usually a very costly operation, but for orthogonal matrices, all we have to do is exchange its rows for its columns.

我们使用二维矩阵来说明这些概念,因为它们很容易可视化。在以下小节中,我们将探讨:

We illustrate these concepts using two-dimensional matrices since they are easy to visualize. In the following subsections, we explore:

  • 矩阵的作用 A 在右边的奇异向量上,它们是列 v 1 v 2 矩阵V。这些被发送到左奇异向量的倍数 1 2 ,它们是U的列。

  • The action of a matrix A on the right singular vectors, which are the columns v 1 and v 2 of the matrix V. These get sent to multiples of the left singular vectors u 1 and u 2 , which are the columns of U.

  • 的行动 A 在标准单位向量上 e 1 e 2 。我们还注意到单位正方形变成了平行四边形。

  • The action of A on the standard unit vectors e 1 and e 2 . We also notice that the unit square gets transformed to a parallelogram.

  • 的行动 A 在一般向量上 X 。这将帮助我们理解矩阵 U V 作为空间中的旋转或反射。

  • The action of A on a general vector x . This will help us understand the matrices U and V as rotations or reflections in space.

  • 的行动 A 在单位圆上。我们看到 A 将单位圆变换为椭圆,其主轴沿左奇异向量( 's),其主轴长度为奇异值( σ 的)。由于奇异值是按照从大到小的顺序排列的,所以 1 定义变化最大的方向,并且 2 定义第二大变化的方向,等等。

  • The action of A on the unit circle. We see that A transforms the unit circle to an ellipse, with its principal axes along the left singular vectors (the u ’s), and the lengths of its principal axes are the singular values (the σ ’s). Since the singular values are ordered from largest to smallest, then u 1 defines the direction with the most variation, and u 2 defines the direction with the second most variation, and so on.

A 对右奇异向量的作用

Action of A on the Right Singular Vectors

A 2 × 2 矩阵:

Let A be the 2 × 2 matrix:

A = 1 5 - 1 2

其奇异值分解 A = U Σ V t 是(谁)给的:

Its singular value decomposition A = U Σ V t is given by:

A = 0 93788501 0 34694625 0 34694625 - 0 93788501 5 41565478 0 0 1 29254915 0 10911677 0 99402894 0 99402894 - 0 10911677

表达方式 A = U Σ V t 相当于:

The expression A = U Σ V t is equivalent to:

A V = U Σ

因为我们要做的就是乘以 A = U Σ V t 从右侧通过V并利用以下事实 V t V = 由于V的正交性。

since all we have to do is multiply A = U Σ V t by V from the right and exploit the fact that V t V = I due to the orthogonality of V.

我们可以将AV视为矩阵A作用于矩阵V的每一列。自从 A V = U Σ ,那么动作 A 在正交列上 V 与拉伸/挤压列相同 U 由奇异值。那是:

We can think of AV as the matrix A acting on each column of the matrix V. Since A V = U Σ , then the action of A on the orthonormal columns of V is the same as stretching/squeezing the columns of U by the singular values. That is:

A v 1 = σ 1 1

and

A v 2 = σ 2 2

这被证明如图6-5所示。

This is demonstrated in Figure 6-5.

埃麦0605
图 6-5。矩阵A将右奇异向量发送到左奇异向量的倍数: A v 1 = σ 1 1 A v 2 = σ 2 2

A 对标准单位向量及其确定的单位平方的作用

Action of A on the Standard Unit Vectors and the Unit Square Determined by Them

矩阵A发送标准单位向量到其自己的列,并将单位正方形转换为平行四边形。空间没有扭曲(弯曲)。图 6-6显示了这种转换。

The matrix A sends the standard unit vectors to its own columns and transforms the unit square into a parallelogram. There is no warping (bending) of space. Figure 6-6 shows this transformation.

埃麦0606
图 6-6。变换标准单位向量

A 在单位圆上的作用

Action of A on the Unit Circle

图6-7所示矩阵A将单位圆变换为椭圆。主轴沿 的和主轴的长度等于 σ 的。同样,由于矩阵表示线性变换,因此存在空间的反射/旋转和拉伸/挤压,但没有扭曲。

Figure 6-7 shows that the matrix A sends the unit circle to an ellipse. The principal axes along the u ’s and lengths of the principal axes are equal to the σ ’s. Again, since matrices represent linear transformations, there is reflection/rotation and stretching/squeezing of space, but no warping.

埃麦0607
图 6-7。矩阵A将单位圆发送到主轴沿左奇异向量且主轴长度等于奇异值的椭圆

我们可以很容易地从奇异值分解中看出所描述的动作。

We can easily see the described action from the singular value decomposition.

极性分解:

The polar decomposition:

A = S

是一种非常简单的方法,从几何角度显示圆形如何转变为椭圆形。

is a very easy way that geometrically shows how a circle gets transformed into an ellipse.

根据奇异值分解分解圆到椭圆变换

Breaking Down the Circle-to-Ellipse Transformation According to the Singular Value Decomposition

图 6-8显示了四个子图,它们分解了前面所示的圆到椭圆变换的步骤:

Figure 6-8 shows four subplots that break down the steps of the circle-to-ellipse transformation illustrated previously:

  1. 首先我们将单位圆与向量相乘 v 1 v 2 经过 V t 。自从 V t V = , 我们有 V t v 1 = e 1 V t v 2 = e 2 。因此,一开始,正确的奇异向量被拉直,与标准单位向量正确对齐。

  2. First we multiply the unit circle and the vectors v 1 and v 2 by V t . Since V t V = I , we have V t v 1 = e 1 and V t v 2 = e 2 . So, in the beginning, the right singular vectors get straightened out, aligning correctly with the standard unit vectors.

  3. 然后我们乘以 Σ 。这里发生的所有事情都是通过拉伸/挤压标准单位向量 σ 1 σ 2 (拉伸或挤压取决于奇异值的大小是大于还是小于 1)。

  4. Then we multiply by Σ . All that happens here is stretching/squeezing the standard unit vectors by σ 1 and σ 2 (the stretch or squeeze depend on whether the magnitude of the singular value is greater or smaller than one).

  5. 最后我们乘以 U 。这要么将椭圆反射到一条线上,要么将其顺时针或逆时针旋转一定的量。下一小节解释了这一点详细。

  6. Finally we multiply by U . This either reflects the ellipse across a line or rotates it a certain amount clockwise or counterclockwise. The next subsection explains this in detail.

埃麦0608
图 6-8。使用奇异值分解进行单位圆到椭圆变换的步骤

旋转和反射矩阵

Rotation and Reflection Matrices

矩阵U V t 出现在奇异值分解中 A = U Σ V t 是正交矩阵。它们的行和列是正交的,它们的逆矩阵与其转置相同。在二维中, U V 可以是旋转矩阵或反射(绕线)矩阵。

The matrices U and V t that appear in the singular value decomposition A = U Σ V t are orthogonal matrices. Their rows and columns are orthonormal, and their inverse is the same as their transpose. In two dimensions, the U and V could either be rotation or reflection (about a line) matrices.

旋转矩阵

Rotation matrix

一个矩阵顺时针旋转一个角度 θ 是(谁)给的:

A matrix that rotates clockwise by an angle θ is given by:

因斯 θ θ - θ 因斯 θ

旋转矩阵的转置是相反方向的旋转。所以如果一个矩阵顺时针旋转一个角度 θ ,那么它的转置逆时针旋转 θ 由下式给出:

The transpose of a rotation matrix is a rotation in the opposite direction. So if a matrix rotates clockwise by an angle θ , then its transpose rotates counterclockwise by θ and is given by:

因斯 θ - θ θ 因斯 θ

反射矩阵

Reflection matrix

反射矩阵大约一行 L 形成一个角度 θ X - 轴是:

A reflection matrix about a line L making an angle θ with the x -axis is:

因斯 2 θ 2 θ 2 θ - 因斯 2 θ

直线L的斜率为 晒黑 θ 并且它经过原点,所以它的方程是 y = 晒黑 θ X 。这条线就像一面反射操作的镜子。图 6-9显示了矩阵围绕的两条直线 V t U与向量一起反射 X 及其后续的转变。

The slope of the straight line L is tan θ and it passes through the origin, so its equation is y = ( tan θ ) x . This line acts like a mirror for the reflection operation. Figure 6-9 shows the two straight lines about which the matrices V t and U reflect, together with a vector x and its subsequent transformation.

旋转矩阵的行列式为1,反射矩阵的行列式为 - 1

The determinant of a rotation matrix is 1, and the determinant of the reflection matrix is - 1 .

在更高维度中,反射矩阵和旋转矩阵看起来不同。始终确保您了解正在处理的对象。如果我们在三维空间中进行旋转,那么绕什么轴旋转?如果我们有反射,那么什么平面呢?如果您想更深入地了解,这是阅读正交矩阵及其属性的好时机。

In higher dimensions, reflection and rotation matrices look different. Always make sure you understand the object you are dealing with. If we have a rotation in a three-dimensional space, then about what axis? If we have a reflection, then about what plane? If you want to dive deeper, this is a good time to read about orthogonal matrices and their properties.

A对一般向量的作用 X

Action of A on a General Vector x

我们有探索了A对右奇异向量(它们映射到左奇异向量)、标准单位向量(它们映射到A的列)、单位正方形(它映射到平行四边形)和单位圆(它被映射到一个椭圆,其主轴沿左奇异向量且其长度等于奇异值)。最后,我们探讨A对一般的、非特殊的向量的作用 X 。这被映射到另一个非特殊向量 A X 。然而,使用奇异值分解将这种变换分解为步骤是有信息的。

We have explored the action of A on the right singular vectors (they get mapped to the left singular vectors), the standard unit vectors (they get mapped to the columns of A), the unit square (it gets mapped to a parallelogram), and the unit circle (it gets mapped to an ellipse with principal axes along the left singular vectors and whose lengths are equal to the singular values). Finally, we explore the action of A on a general, nonspecial, vector x . This gets mapped to another nonspecial vector A x . However, breaking down this transformation into steps using the singular value decomposition is informative.

回想一下我们的矩阵A及其奇异值分解:

Recall our matrix A and its singular value decomposition:

A = 1 5 - 1 2 = U Σ V t = 0 93788501 0 34694625 0 34694625 - 0 93788501 5 41565478 0 0 1 29254915 0 10911677 0 99402894 0 99402894 - 0 10911677

U和_ V t 在这个奇异值分解中恰好是反射矩阵。直线 L U L V t 图 6-9绘制了充当这些反射的镜子,并且从各自的矩阵中很容易找到它们的方程: 因斯 2 θ 2 θ 位于第一行,因此我们可以使用它们来找到斜率 晒黑 θ 。沿该直线的方程 V t 反映的是然后 y = 晒黑 θ V t X = 0 8962347008436108 X , U反射的线是 y = 晒黑 θ U X = 0 17903345403184898 X 。自从 A X = U Σ V t X , 第一的 X 反映到线上 L V t ,到达 V t X 。然后,当我们乘以 Σ 从左边开始,第一个坐标 V t X 水平拉伸第一个奇异值,第二个坐标拉伸第二个奇异值,得到 Σ V t X 。最后,当我们乘以U时,向量 Σ V t X 反映到线上 L U ,到达 A X = U Σ V t X 图 6-9说明了这个过程。

Both U and V t in this singular value decomposition happen to be reflection matrices. The straight lines L U and L V t that act as mirrors for these reflections are plotted in Figure 6-9, and their equations are easy to find from their respective matrices: cos ( 2 θ ) and sin ( 2 θ ) are on the first row, so we can use those to find the slope tan ( θ ) . The equation of the line along which V t reflects is then y = ( tan θ V t ) x = 0 . 8962347008436108 x , and that of the line along which U reflects is y = ( tan θ U ) x = 0 . 17903345403184898 x . Since A x = U Σ V t x , first x gets reflected across the line L V t , arriving at V t x . Then, when we multiply by Σ from the left, the first coordinate of V t x gets stretched horizontally by the first singular value, and the second coordinate gets stretched by the second singular value, obtaining Σ V t x . Finally, when we multiply by U, the vector Σ V t x gets reflected across the line L U , arriving at A x = U Σ V t x . Figure 6-9 illustrates this process.

埃麦0609
图 6-9。矩阵 A 对一般向量的作用 X 。该变换是使用奇异值分解分步完成的。

矩阵相乘的三种方法

Three Ways to Multiply Matrices

高效的在大数据时代,矩阵乘法算法非常受欢迎。理论上,两个矩阵相乘有三种方法 A ×n n×s :

Efficient algorithms for matrix multiplication are so desirable in the age of big data. In theory, there are three ways to multiply two matrices A m×n and B n×s :

行列法
Row-column approach

制作一个条目 A j 每次取A的第 i行与B的第j列的点积:

A j = A rw C j = Σ k=1 n A k kj

Produce one entry (ab) ij at a time by taking the dot product of the ith row of A with the jth column of B:

(ab) ij = A row i B col j = k=1 n a ik b kj
列-列方法
Column-columns approach

生成一栏 A C 一次通过线性组合的列 A 使用第 i列的条目 :

A C = 1 A C 1 + 2 A C 2 + + n A C n

Produce one column (AB) col i at a time by linearly combining the columns of A using the entries of the ith column of B :

(AB) col i = b 1i A col 1 + b 2i A col 2 + + b ni A col n
列行法
Column-row approach

通过将A的第一列与B的第一行相乘,将A的第二列与B的第二行相乘,依此类推,一次生成一个等级的产品。然后将所有这些一级矩阵相加得到最终产品AB

A = A C 1 rw 1 + A C 2 rw 2 + + A C n rw n

Produce rank one pieces of the product, one piece at a time, by multiplying the first column of A with the first row of B, the second column of A with the second row of B, and so on. Then add all these rank one matrices together to get the final product AB:

A B = A col 1 B row 1 + A col 2 B row 2 + + A col n B row n

这如何帮助我们理解奇异值分解的用处?我们可以扩展产品 A = U Σ V t 使用列行方法进行矩阵乘法,将奇异值分解为秩一矩阵之和。在这里,我们将矩阵相乘 U Σ (缩放每列 U C U的_ σ ) 和 V t :

How does this help us understand the usefulness of the singular value decomposition? We can expand the product A = U Σ V t of the singular value decomposition as a sum of rank one matrices, using the column-row approach for matrix multiplication. Here, we multiply the matrix U Σ (which scales each column U col i of U by σ i ) with V t :

A = U Σ V t = σ 1 U C 1 V rw 1 t + σ 2 U C 2 V rw 2 t + + σ r U C r V rw r t

其中r是A的非零奇异值的数量(也称为A 的秩)。

where r is the number of nonzero singular values of A (also called the rank of A).

这个表达式的伟大之处在于它将A分割成按照重要性顺序排列的一级矩阵之和,因为 σ 的按降序排列。此外,它提供了一种通过较低秩矩阵来近似 A 的直接方法:丢弃较低的奇异值。Eckart -Young-Mirsky 定理断言,当使用Frobenius 范数(即奇异值平方和的平方根)测量近似的接近时,这实际上是找到A的低秩近似的最佳方法)对于矩阵。在本章后面,我们将利用A的一级分解来进行数字图像压缩。

The great thing about this expression is that it splits A into a sum of rank one matrices arranged according to their order of importance, since the σ ’s are arranged in decreasing order. Moreover, it provides a straightforward way to approximate A by lower rank matrices: throw away lower singular values. The Eckart–Young–Mirsky theorem asserts that this is in fact the best way to find low rank approximation of A, when the closeness of approximation is measured using the Frobenius norm (which is the square root of the sum of squares of the singular values) for matrices. Later in this chapter, we take advantage of this rank one decomposition of A for digital image compression.

矩阵乘法算法

Algorithms for Matrix Multiplication

寻找有效的矩阵乘法算法是一个重要但又极其困难的目标。在矩阵乘法算法中,即使节省一次乘法运算也是值得的(节省加法运算并不是什么大问题)。最近,DeepMind 开发了AlphaTensor (2022),以自动发现更高效的矩阵乘法算法。这是一个里程碑,因为矩阵乘法是大量技术的基本组成部分,包括神经网络、计算机图形学和科学计算。

Finding efficient algorithms for matrix multiplication is an essential, yet surprisingly difficult, goal. In matrix multiplication algorithms, saving on even one multiplication operation is worthy (saving on addition is not as much of a big deal). Recently, DeepMind developed AlphaTensor (2022) to automatically discover more efficient algorithms for matrix multiplication. This is a milestone because matrix multiplication is a fundamental part of a vast array of technologies, including neural networks, computer graphics, and scientific computing.

大局观

The Big Picture

到目前为止我们专注于矩阵的奇异值分解 A = U Σ V t A对空间的作用以及使用较低秩矩阵逼近A而言。在转向与人工智能相关的应用程序之前,让我们先以鹰眼的视角并着眼于大局。

So far we have focused on the singular value decomposition of a matrix A = U Σ V t in terms of A’s action on space and in terms of approximating A using lower rank matrices. Before moving to applications relevant to AI, let’s have an eagle-eye perspective and address the big picture.

给定一个实数矩阵,我们希望根据我们的用例了解以下内容:

Given a matrix of real numbers, we want to understand the following, depending on our use case:

  • 如果矩阵代表我们关心的数据,例如图像或表格数据,那么该矩阵(数据)最重要的组成部分是什么?

  • If the matrix represents data that we care about, like images or tabular data, what are the most important components of this matrix (data)?

  • 数据主要沿着哪些重要方向传播(数据变化最大的方向)?

  • Along what important directions is the data mostly spread (directions with most variation in the data)?

  • 如果我想到一个矩阵 A ×n 作为从初始空间的变换 n 到目标空间 ,这个矩阵对向量有什么影响 n ?它们被发送到哪些载体

  • If I think of a matrix A m×n as a transformation from the initial space n to the target space m , what is the effect of this matrix on vectors in n ? To which vectors do they get sent in m ?

  • 这个矩阵对空间本身有什么影响?由于这是一个线性变换,我们知道不存在空间扭曲,但存在空间拉伸、挤压、旋转和反射。

  • What is the effect of this matrix on space itself? Since this is a linear transformation, we know there is no space warping, but there is space stretching, squeezing, rotating, and reflecting.

  • 许多物理系统可以表示为线性方程组 A X = 。我们怎样才能解决这个系统(找到 X )?根据A的属性,最有效的方法是什么?如果没有解,是否有一个近似解可以满足我们的目的?请注意,这里我们正在寻找未知向量 X 变成 A作用于它时。

  • Many physical systems can be represented as a system of linear equations A x = b . How can we solve this system (find x )? What is the most efficient way to go about this, depending on the properties of A? If there is no solution, is there an approximate solution that satisfies our purposes? Note that here we are looking for the unknown vector x that gets transformed to b when A acts on it.

奇异值分解可以用来回答所有这些问题。前两个是矩阵本身固有的,而后两个与矩阵与向量相乘的效果有关(矩阵作用于空间和该空间中的向量)。最后一个问题与求解线性方程组这一非常重要的问题有关,并且出现在各种应用中。

The singular value decomposition can be used to answer all these questions. The first two are intrinsic to the matrix itself, while the second two have to do with the effect of multiplying the matrix with vectors (the matrix acts on space and the vectors in this space). The last question has to do with the very important problem of solving systems of linear equations and appears in all kinds of applications.

因此,我们可以通过两种方式研究数字矩阵:

Therefore, we can investigate a matrix of numbers in two ways:

  • 它的内在属性是什么?

  • What are its intrinsic properties?

  • 当将其视为转换时,它有哪些属性?

  • What are its properties when viewed as a transformation?

这两者是相关的,因为矩阵的内在属性会影响它对向量和空间的作用。

These two are related because the matrix’s intrinsic properties affect how it acts on vectors and on space.

以下是需要牢记的属性:

The following are properties to keep in mind:

  • A 发送正交向量 v (右奇异向量)其初始空间到正交向量的标量倍数 其目标空间的(左奇异向量):

    A v = σ
  • A sends the orthonormal vectors v i (right singular vectors) of its initial space to scalar multiples of the orthonormal vectors u i (left singular vectors) of its target space:

    A v i = σ i u i
  • 如果我们的矩阵是方阵,则其行列式的绝对值等于其所有奇异值的乘积: σ 1 σ 2 σ r

  • If our matrix is square, then the absolute value of its determinant is equal to the product of all its singular values: σ 1 σ 2 σ r .

  • 矩阵的条件数,关于 2 范数是欧几里得空间中的通常距离,是最大奇异值与最小奇异值的比率价值:

    κ = σ 1 σ r
  • The condition number of the matrix, with respect to the l 2 norm, which is the usual distance in Euclidean space, is the ratio of the largest singular value to the smallest singular value:

    κ = σ 1 σ r

条件数和计算稳定性

The Condition Number and Computational Stability

条件数对于计算稳定性非常重要:

The condition number is very important for computational stability:

  • 条件数衡量的是多少 A 拉伸空间。如果条件数太大,则它在一个方向相对于另一方向将空间拉伸太多,并且在如此极端拉伸的空间中进行计算可能是危险的。解决 A X = 什么时候 A 有一个很大的条件数使得解决方案 X 不稳定是指它对扰动极其敏感 。一个小错误 将产生解决方案 X 这与没有错误的解决方案有很大不同 。很容易从几何角度想象这种不稳定性。

  • The condition number measures how much A stretches space. If the condition number is too large, then it stretches space too much in one direction relative to another direction, and it could be dangerous to do computations in such an extremely stretched space. Solving A x = b when A has a large condition number makes the solution x unstable in the sense that it is extremely sensitive to perturbations in b . A small error in b will result in a solution x that is wildly different from the solution without the error in b . It is easy to envision this instability geometrically.

  • 数值求解 A X = (例如,通过高斯消去法)当涉及的矩阵具有合理(不是很大)的条件数时,迭代方法可以很好地工作。

  • Numerically solving A x = b (say, by Gaussian elimination) and iterative methods works fine when the involved matrices have reasonable (not very large) condition numbers.

  • 关于具有特别大的条件数的矩阵的一件事是:它将空间拉伸得如此之大,以至于几乎塌缩成较低维度的空间。有趣的是,如果我们决定丢弃那个非常小的奇异值,从而在较低维度的折叠空间中工作,我们的计算就会变得非常好。因此,极端的边界存在着常态,只不过这种常态现在处于较低的维度。

  • One thing about a matrix with an especially large condition number: it stretches space so much that it almost collapses into a space of lower dimension. The interesting part is that if we decide to throw away that very small singular value and hence work in the collapsed space of lower dimension, our computations become perfectly fine. So at the boundaries of extremeness lies normalcy, except that this normalcy now lies in a lower dimension.

  • 许多迭代数值方法,包括非常有用的梯度下降,都在其分析中涉及矩阵。如果这些矩阵的条件数太大,则迭代方法可能无法收敛到解。条件数控制这些迭代方法收敛的速度。

  • Many iterative numerical methods, including the very useful gradient descent, have matrices involved in their analysis. If the condition number of these matrices is too large, then the iterative method might not converge to a solution. The condition number controls how fast these iterative methods converge.

奇异值分解的组成部分

The Ingredients of the Singular Value Decomposition

在这个这一章我们只剖析了一个公式: A = U Σ V t 我们使用 Python 来计算U的条目, Σ 、 和V,但这些条目到底是什么?如果我们碰巧知道什么是特征向量和特征值,答案很简单,我们将在下一节中对此进行澄清。现在,我们列出了U的成分, Σ ,和V

In this chapter we have been dissecting only one formula: A = U Σ V t . We used Python to compute the entries of U, Σ , and V, but what exactly are these entries? The answer is short, if we happen to know what eigenvectors and eigenvalues are, which we clarify in the next section. For now, we list the ingredients of U, Σ , and V:

  • 的列 V (右奇异向量)是对称矩阵的正交特征向量 A t A

  • The columns of V (the right singular vectors) are the orthonormal eigenvectors of the symmetric matrix A t A .

  • 的列 U (左奇异向量)是对称矩阵的正交特征向量 A A t

  • The columns of U (the left singular vectors) are the orthonormal eigenvectors of the symmetric matrix A A t .

  • 奇异值 σ 1 , σ 2 , σ r 是特征值的平方根 A t A 或者 A A t 。奇异值是非负的并且按降序排列。奇异值可以为零。

  • The singular values σ 1 , σ 2 , σ r are the square roots of the eigenvalues of A t A or A A t . The singular values are nonnegative and arranged in decreasing order. The singular values can be zero.

  • A v = σ

  • A v i = σ i u i .

每个实对称正半定(非负特征值)矩阵都是可对角化的 S = D -1 ,这意味着当在不同的坐标集( P的列)中查看时,它类似于对角矩阵D A t A A A t 两者恰好都是对称正半定的,因此它们是可对角化的。

Every real symmetric positive semi-definite (nonnegative eigenvalues) matrix is diagonalizable S = P D P -1 , which means that it is similar to a diagonal matrix D when viewed in a different set of coordinates (the columns of P). A t A and A A t both happen to be symmetric positive semi-definite, so they are diagonalizable.

奇异值分解与特征值分解

Singular Value Decomposition Versus the Eigenvalue Decomposition

这是如果我们想了解奇异值分解的成分,那么了解更多有关对称矩阵的知识很重要。这也将帮助我们辨别奇异值分解之间的区别 A = U Σ V t 和特征值分解 A = D -1 当后者存在时。

It is important to learn more about symmetric matrices if we want to understand the ingredients of the singular value decomposition. This will also help us discern the difference between the singular value decomposition A = U Σ V t and the eigenvalue decomposition A = P D P -1 when the latter exists.

奇异值分解(SVD)始终存在,但特征值分解仅适用于特殊矩阵,称为可对角化。矩形矩阵永远不能对角化。方阵可能可对角化,也可能不可对角化。当方阵可对角化时,SVD 和特征值分解不相等,除非矩阵是对称的且具有非负特征值。

The singular value decomposition (SVD) always exists, but the eigenvalue decomposition exists only for special matrices, called diagonalizable. Rectangular matrices are never diagonalizable. Square matrices may or may not be diagonalizable. When the square matrix is diagonalizable, the SVD and the eigenvalue decomposition are not equal, unless the matrix is symmetric and has nonnegative eigenvalues.

我们可以根据矩阵的可取性来考虑层次结构:

We can think of a hierarchy in terms of the desirability of matrices:

  1. 最好和最简单的矩阵是沿对角线具有相同数字的方对角矩阵。

  2. The best and easiest matrices are square diagonal matrices with the same number along the diagonal.

  3. 第二好的矩阵是方对角矩阵 D 沿对角线不一定具有相同的数字。

  4. The second best ones are square diagonal matrices D that don’t necessarily have the same numbers along the diagonal.

  5. 第三好的矩阵是对称矩阵。它们具有实数特征值和正交特征向量。它们是最接近对角矩阵的矩阵类型,因为它们是可对角化的 S = D -1 ,或类似于基数变化后的对角矩阵。的列 (特征向量)是正交的。

  6. The third best matrices are symmetric matrices. These have real eigenvalues and orthogonal eigenvectors. They are the next closest type of matrices to diagonal matrices, in the sense that they are diagonalizable S = P D P -1 , or similar to a diagonal matrix after a change in basis. The columns of P (eigenvectors) are orthogonal.

  7. 第四个最佳矩阵是可对角化的方阵 A = D -1 。这些类似于改变基后的对角矩阵;然而,列 (特征向量)不必是正交的。

  8. The fourth best matrices are square matrices that are diagonalizable A = P D P -1 . These are similar to a diagonal matrix after a change of basis; however, the columns of P (eigenvectors) need not be orthogonal.

  9. 第五个最佳矩阵是其余的。它们不可对角化,这意味着没有任何基础变化可以将它们变成对角线;然而,有一种最接近的方法可以通过奇异值分解使它们类​​似于对角矩阵 A = U Σ V t 。这里 U V 彼此不同,并且它们具有正交的列和行。它们的逆非常容易,因为它与它们的转置相同。奇异值分解适用于方阵和非方阵。

  10. The fifth best matrices are all the rest. These are not diagonalizable, meaning there is no change of basis that can turn them diagonal; however, there is the next closest approach to making them similar to a diagonal matrix via the singular value decomposition A = U Σ V t . Here U and V are different from each other, and they have orthonormal columns and rows. Their inverse is very easy, since it is the same as their transpose. The singular value decomposition works for both square and nonsquare matrices.

给定一个矩阵 A , 两个都 A t A A A t 恰好是对称且半正定的(意味着它们的特征值是非负的);因此,它们可以用正交特征向量的两个基对角化。当我们除以这些正交特征向量的范数时,它们就变成正交特征向量。这些是列 V 和的 U , 分别。

Given a matrix A , both A t A and A A t happen to be symmetric and positive semi-definite (meaning their eigenvalues are nonnegative); thus, they are diagonalizable with two bases of orthogonal eigenvectors. When we divide by the norm of these orthogonal eigenvectors, they become orthonormal. These are the columns of V and of U , respectively.

A t A A A t 具有完全相同的非负特征值, λ = σ 2 。按降序排列这些的平方根(保持相应的特征向量顺序 U V ),我们得到对角矩阵 Σ 在奇异值分解中。

A t A and A A t have exactly the same nonnegative eigenvalues, λ i = σ i 2 . Arrange the square root of these in decreasing order (keeping the corresponding eigenvector order in U and V ), and we get the diagonal matrix Σ in the singular value decomposition.

如果我们开始的矩阵是对称的怎么办?它的奇异值分解是怎样的 A = U Σ V t 与其对角化有关 A = D -1 ?的列 ,它们是对称的特征向量 A ,是正交的。当我们除以它们的长度时,它们就变成正交的。按照特征值绝对值递减的顺序将这些正交特征向量堆叠在矩阵中,我们得到 U V 用于奇异值分解。现在如果对称的所有特征值 A 恰好是非负的,这个正半定对称矩阵的奇异值分解将与其特征值分解相同,前提是您将正交特征向量标准化为 ,根据非负特征值按降序对它们进行排序。所以 U = V 在这种情况下。如果部分(或全部)特征值为负怎么办?然后 σ = | λ | = - λ ,但现在我们必须小心相应的特征向量 A v = - λ v = λ - v = σ 。这使得 U V 在奇异值分解中不等。因此,具有一些负特征值的对称矩阵的奇异值分解可以很容易地从其特征值分解中提取出来,但并不完全相同。

What if the matrix we start with is symmetric? How is its singular value decomposition A = U Σ V t related to its diagonalization A = P D P -1 ? The columns of P , which are the eigenvectors of symmetric A , are orthogonal. When we divide by their lengths, they become orthonormal. Stack these orthonormal eigenvectors in a matrix in the order corresponding to the decreasing absolute value of the eigenvalues and we get both the U and the V for the singular value decomposition. Now if all the eigenvalues of symmetric A happen to be nonnegative, the singular value decomposition of this positive semi-definite symmetric matrix will be the same as its eigenvalue decomposition, provided you normalize the orthogonal eigenvectors in P , ordering them with respect to the nonnegative eigenvalues in decreasing order. So U = V in this case. What if some (or all) of the eigenvalues are negative? Then σ i = | λ i | = - λ i , but now we have to be careful with the corresponding eigenvectors A v i = - λ i v i = λ i ( - v i ) = σ i u i . This makes U and V in the singular value decomposition unequal. So the singular value decomposition of a symmetric matrix that has some negative eigenvalues can be easily extracted from its eigenvalue decomposition, but it is not exactly the same.

如果我们开始的矩阵不是对称的而是可对角化的怎么办?它的奇异值分解是怎样的 A = U Σ V t 与其对角化有关 A = D -1 ?在这种情况下,特征向量 A ,它们是 ,一般来说不是正交的,因此这样的矩阵的奇异值分解和特征值分解不是有关的。

What if the matrix we start with is not symmetric but diagonalizable? How is its singular value decomposition A = U Σ V t related to its diagonalization A = P D P -1 ? In this case, the eigenvectors of A , which are the columns of P , are in general not orthogonal, so the singular value decomposition and the eigenvalue decomposition of such a matrix are not related.

奇异值分解的计算

Computation of the Singular Value Decomposition

怎么办Python等数值计算矩阵的奇异值分解?幕后有哪些数值算法?快速回答是:QR 分解Householder 反射以及特征值和特征向量的迭代算法

How do Python and others numerically calculate the singular value decomposition of a matrix? What numerical algorithms lie under the hood? The fast answer is: QR decomposition, Householder reflections, and iterative algorithms for eigenvalues and eigenvectors.

理论上,计算一般矩阵的奇异值分解,或者方阵的特征值和特征向量,需要设置一个等于0的多项式来求解特征值,然后建立线性方程组来求解特征向量。这对于应用来说还远远不够实用。找到多项式零点的问题对多项式系数的任何变化都非常敏感,因此计算问题很容易出现系数中存在的舍入误差。我们需要稳定的数值方法来查找特征向量和特征值,而无需对多项式的零点进行数值计算。此外,我们需要确保线性方程组中涉及的矩阵条件良好,否则流行的方法如高斯消去法( L U 分解)不起作用。

In theory, calculating the singular value decomposition for a general matrix, or the eigenvalues and the eigenvectors for a square matrix, requires setting a polynomial equal to 0 to solve for the eigenvalues, then setting up a linear system of equations to solve for the eigenvectors. This is far from being practical for applications. The problem of finding the zeros of a polynomial is very sensitive to any variations in the coefficients of the polynomials, so the computational problem becomes prone to roundoff errors that are present in the coefficients. We need stable numerical methods that find the eigenvectors and eigenvalues without having to numerically compute the zeros of a polynomial. Moreover, we need to make sure that the matrices involved in linear systems of equations are well conditioned, otherwise popular methods like Gaussian elimination (the L U decomposition) do not work.

大多数奇异值分解的数值实现都试图避免计算 A A t A t A 。这与本书的主题之一是一致的:避免矩阵相乘;相反,将矩阵与向量相乘。流行的奇异值分解数值方法使用一种称为Householder 反射的算法将矩阵转换为双对角矩阵(有时先进行QR分解),然后使用迭代算法来查找特征值和特征向量。数值线性代数领域开发了此类方法,并使它们适应应用中出现的矩阵的类型和大小。在下一小节中,我们提出一种迭代方法来计算给定矩阵的一个特征值及其相应的特征向量。

Most numerical implementations of the singular value decomposition try to avoid computing A A t and A t A . This is consistent with one of the themes of this book: avoid multiplying matrices; instead, multiply a matrix with vectors. The popular numerical method for the singular value decomposition uses an algorithm called Householder reflections to transform the matrix to a bidiagonal matrix (sometimes preceded by a QR decomposition), then uses iterative algorithms to find the eigenvalues and eigenvectors. The field of numerical linear algebra develops such methods and adapts them to the types and sizes of matrices that appear in applications. In the next subsection, we present an iterative method to compute one eigenvalue and its corresponding eigenvector for a given matrix.

以数值方式计算特征向量

Computing an Eigenvector Numerically

一个方阵的特征向量 A 是一个非零向量,乘以时不会改变其方向 A ; 相反,它仅按特征值缩放 λ :

An eigenvector of a square matrix A is a nonzero vector that does not change its direction when multiplied by A ; instead, it only gets scaled by an eigenvalue λ :

A v = λ v

以下迭代算法是一种简单的数值方法,可找到与其最大特征值相对应的矩阵特征向量:

The following iterative algorithm is an easy numerical method that finds an eigenvector of a matrix corresponding to its largest eigenvalue:

  1. 从随机单位向量(长度为 1)开始 v 0

  2. Start at a random unit vector (of length 1) v 0 .

  3. 乘以 A v +1 = A v

  4. Multiply by A : v i+1 = A v i .

  5. 除以长度 v +1 以避免向量的大小变得太大。

  6. Divide by the length of v i+1 to avoid the size of our vectors growing too large.

  7. 当你汇合时停下来。

  8. Stop when you converge.

这种迭代方法非常简单,但有一个缺点:它只能找到矩阵的一个特征向量——其最大特征值对应的特征向量。所以它会找到我们应用时拉伸最大的方向 A

This iterative method is very simple but has a drawback: it only finds one eigenvector of the matrix—the eigenvector corresponding to its largest eigenvalue. So it finds the direction that gets stretched the most when we apply A .

例如,考虑矩阵 A = 1 2 2 - 3 。我们从向量开始 v 0 = 1 0 并应用上述算法。我们记下对向量进行 28 次迭代后的算法 v = - 0 38268343 0 92387953 。代码位于链接的 Jupyter 笔记本中,输出如下所示:

For example, consider the matrix A = 1 2 2 - 3 . We start with the vector v 0 = 1 0 and apply the above algorithm. We note the algorithm after 28 iterations to the vector v = - 0 . 38268343 0 . 92387953 . The code is in the linked Jupyter notebook and the output is shown here:

[1, 0]
[0.4472136 0.89442719]
[0.78086881-0.62469505]
[-0.1351132 0.99083017]
[0.49483862-0.86898489]
[-0.3266748 0.9451368]
[0.40898444-0.91254136]
[-0.37000749 0.92902877]
[0.38871252-0.92135909]
[-0.37979817 0.92506937]
[0.3840601-0.9233081]
[-0.38202565 0.92415172]
[0.38299752-0.92374937]
[-0.38253341 0.92394166]
[0.38275508-0.92384985]
[-0.38264921 0.92389371]
[0.38269977-0.92387276]
[-0.38267563 0.92388277]
[0.38268716-0.92387799]
[-0.38268165 0.92388027]
[0.38268428-0.92387918]
[-0.38268303 0.9238797 ]
[0.38268363-0.92387945]
[-0.38268334 0.92387957]
[0.38268348-0.92387951]
[-0.38268341 0.92387954]
[0.38268344-0.92387953]
[-0.38268343 0.92387953]
[0.38268343-0.92387953]

 v= [-0.38268343 0.92387953]
AV= [ 1.46507563 -3.53700546]
$\lambda=$ -3.828427140993716
[1, 0]
[0.4472136  0.89442719]
[ 0.78086881 -0.62469505]
[-0.1351132   0.99083017]
[ 0.49483862 -0.86898489]
[-0.3266748  0.9451368]
[ 0.40898444 -0.91254136]
[-0.37000749  0.92902877]
[ 0.38871252 -0.92135909]
[-0.37979817  0.92506937]
[ 0.3840601 -0.9233081]
[-0.38202565  0.92415172]
[ 0.38299752 -0.92374937]
[-0.38253341  0.92394166]
[ 0.38275508 -0.92384985]
[-0.38264921  0.92389371]
[ 0.38269977 -0.92387276]
[-0.38267563  0.92388277]
[ 0.38268716 -0.92387799]
[-0.38268165  0.92388027]
[ 0.38268428 -0.92387918]
[-0.38268303  0.9238797 ]
[ 0.38268363 -0.92387945]
[-0.38268334  0.92387957]
[ 0.38268348 -0.92387951]
[-0.38268341  0.92387954]
[ 0.38268344 -0.92387953]
[-0.38268343  0.92387953]
[ 0.38268343 -0.92387953]

 v= [-0.38268343  0.92387953]
Av= [ 1.46507563 -3.53700546]
$\lambda=$ -3.828427140993716

图 6-10显示了此迭代。请注意,所有向量的长度均为 1,并且当算法收敛时向量的方向不会改变,因此捕获A的特征向量。在最后几次迭代中,符号不断振荡,因此向量不断翻转方向,并且特征值必须为负。确实,我们找到了成为 λ = - 3 828427140993716

Figure 6-10 shows this iteration. Note that all the vectors have length 1 and that the direction of the vector does not change when the algorithm converges, hence capturing an eigenvector of A. For the last few iterations, the sign keeps oscillating, so the vector keeps flipping orientation, and the eigenvalue must be negative. Indeed, we find it to be λ = - 3 . 828427140993716 .

埃麦0610
图 6-10。我们开始于 v 0 = 1 0 ,然后乘以A并标准化,直到收敛到特征向量

伪逆

The Pseudoinverse

许多物理系统可以用线性方程组表示(或近似) A X = 。如果 X 是我们关心的未知向量,那么我们需要除以矩阵A才能找到 X 。除法的矩阵等效项是求逆矩阵 A -1 ,使得解 X = A -1 。具有逆矩阵称为可逆矩阵。这些是具有非零行列式的方阵(行列式是特征值的乘积;奇异值和行列式的乘积将具有相同的绝对值)。但是所有矩阵都是矩形的系统又如何呢?那些具有不可逆矩阵的情况又如何呢?还有那些矩阵是方阵且可逆的,但几乎不可逆的(它们的行列式非常接近于零)?我们仍然关心寻找此类系统的解决方案。奇异值分解的强大之处在于它对于任何矩阵都存在,包括上面提到的那些,并且它可以帮助我们反转任何矩阵。

Many physical systems can be represented by (or approximated by) a linear system of equations A x = b . If x is an unknown vector that we care for, then we need to divide by matrix A in order to find x . The matrix equivalent of division is finding the inverse A -1 , so that the solution x = A -1 b . Matrices that have an inverse are called invertible. These are square matrices with a nonzero determinant (the determinant is the product of the eigenvalues; the product of the singular values and the determinant will have the same absolute value). But what about all the systems whose matrices are rectangular? How about those with noninvertible matrices? And those whose matrices are square and invertible, but are almost noninvertible (their determinant is very close to zero)? We still care about finding solutions to such systems. The power of the singular value decomposition is that it exists for any matrix, including those mentioned above, and it can help us invert any matrix.

给定任意矩阵及其奇异值分解 A = U Σ V t ,我们可以将其伪逆定义为:

Given any matrix and its singular value decomposition A = U Σ V t , we can define its pseudoinverse as:

A + = V Σ + U t

在哪里 Σ + 是从获得 Σ 通过反转其所有对角线条目,除了零的条目(或者如果矩阵恰好是病态的,则非常接近于零)。

where Σ + is obtained from Σ by inverting all its diagonal entries except for the ones that are zero (or very close to zero if the matrix happens to be ill-conditioned).

这使我们能够找到任何线性方程组的解 A X = ,即 X = A +

This allows us to find solutions to any system of linear equation A x = b , namely x = A + b .

矩阵的伪逆与它的逆一致,当后者存在。

The pseudoinverse of a matrix coincides with its inverse when the latter exists.

将奇异值分解应用于图像

Applying the Singular Value Decomposition to Images

我们是最终为奇异值分解的实际应用做好了准备。我们从图像压缩开始。数字图像存储为数字矩阵,其中每个数字对应于像素的强度。我们将使用奇异值分解来减少图像的存储要求,而不会丢失其最重要的信息。我们所要做的就是扔掉那些无关紧要的奇异值,以及相应的U列和 U 的行 V t 。对我们有帮助的数学表达式是:

We are finally ready for real-world applications of the singular value decomposition. We start with image compression. Digital images are stored as matrices of numbers, where each number corresponds to the intensity of a pixel. We will use the singular value decomposition to reduce the storage requirements of an image without losing its most essential information. All we have to do is throw away the insignificant singular values, along with the corresponding columns of U and rows of V t . The mathematical expression that helps us here is:

A = U Σ V t = σ 1 U C 1 V rw 1 t + σ 2 U C 2 V rw 2 t + + σ r U C r V rw r t

回想一下, σ 是从最大值到最小值排列的,所以我们的想法是我们可以保留前几个大的 σ 的,然后扔掉剩下的 σ 的,无论如何都很小。

Recall that the σ ’s are arranged from the largest value to the smallest value, so the idea is that we can keep the first few large σ ’s and throw away the rest of the σ ’s, which are small anyway.

让我们处理图 6-11中的图像。代码和详细信息位于本书的 GitHub 页面中。每个彩色图像具有三个通道:红色、绿色和蓝色(见图6-126-13)。每个通道都是一个数字矩阵,就像我们在本章中使用的矩阵一样。

Let’s work with the image in Figure 6-11. The code and details are in the book’s GitHub page. Each color image has three channels: red, green, and blue (see Figures 6-12 and 6-13). Each channel is a matrix of numbers, just like the ones we have been working with in this chapter.

图6-11中图像的每个通道都是一个 s z e = 960 × 第714章 矩阵,因此要存储完整图像,我们需要 s z e = 960 × 第714章 × 3 = 2 , 056 , 320 数字。想象一下包含许多图像帧的流视频的存储要求。我们需要一个压缩机制,以免内存耗尽。我们计算每个通道的奇异值分解(参见图 6-14红色通道奇异值分解的图像表示)。然后,我们执行大规模缩减,仅保留每个通道的前 25 个奇异值(共 714 个)、 U的 25 列(共 960 个)和 25 行 V t (共 714 个)。每个通道的存储减少量很大:U现在 960 × 25 , V t 25 × 第714章 ,我们只需要存储25个奇异值(不需要存储对角矩阵的零点 Σ )。每个通道总共有 41,875 个数字,因此对于所有 3 个通道,我们需要存储 41 , 第875章 × 3 = 125 , 625 数字,存储需求减少了 93%。

Each channel of the image in Figure 6-11 is a s i z e = 960 × 714 matrix, so to store the full image, we need s i z e = 960 × 714 × 3 = 2 , 056 , 320 numbers. Imagine the storage requirements for a streaming video, which contains many image frames. We need a compression mechanism, so as not to run out of memory. We compute the singular value decomposition for each channel (see Figure 6-14 for an image representation of the singular value decomposition for the red channel). We then perform a massive reduction, retaining for each channel only the first 25 singular values (out of 714), 25 columns of U (out of 960), and 25 rows of V t (out of 714). The storage reduction for each channel is substantial: U is now 960 × 25 , V t is 25 × 714 , and we only need to store 25 singular values (no need to store the zeros of the diagonal matrix Σ ). This adds up 41,875 numbers for each channel, so for all 3 channels, we need to store 41 , 875 × 3 = 125 , 625 numbers, a whopping 93% storage requirement reduction.

我们将图像重新组合在一起,一次一个通道,通过乘以缩小的U、缩小的 Σ ,并减少了 V t 一起:

We put the image back together, one channel at a time, by multiplying the reduced U, reduced Σ , and reduced V t together:

C H A n n e redCed = U 960×25 Σ 25×25 V t 25×第714章

图 6-15显示了每个红色、绿色和蓝色通道的乘法结果。

Figure 6-15 shows the result of this multiplication for each of the red, green, and blue channels.

最后,我们对缩减通道进行分层以生成缩减图像(图 6-16)。显然,我们在这个过程中丢失了很多细节,但这是我们必须接受的权衡。

Finally, we layer the reduced channels to produce the reduced image (Figure 6-16). It is obvious that we lost a lot of detail in the process, but that is a trade-off we have to live with.

埃麦0611
图 6-11。的数字彩色图像 s z e = 960 × 第714章 × 3
埃麦0612
图 6-12。数字图像的红色、绿色和蓝色通道。每个都有 s z e = 960 × 第714章
埃麦0613
图 6-13。显示数字图像三个通道的红色、绿色和蓝色色调。每个都有 s z e = 960 × 第714章 × 3 。请参阅GitHub 上的彩色图像
埃麦0614
图 6-14。红色通道的奇异值分解。我们有 714 个非零奇异值,但有效的奇异值很少。即使对角矩阵 Σ 看起来全黑,它的对角线上有非零奇异值。像素不够亮,无法在此分辨率级别下清晰显示。
埃麦0615
图 6-15。降级后的红、绿、蓝通道。对于每个通道,我们仅保留前 25 个奇异值、U的前 25 列和 U 的前 25 行 V t
埃麦0616
图 6-16。具有 714 个奇异值的原始图像与仅具有 25 个奇异值的降级图像。两人都还有 s z e = 960 × 第714章 × 3 但需要不同的存储空间。

用于高级图像压缩技巧,查看这篇文章

For advanced image compression techniques, check this article.

主成分分析和降维

Principal Component Analysis and Dimension Reduction

主要的成分分析在数据分析中广泛流行。它用于无监督机器学习中的降维和聚类。简单来说,就是对数据矩阵X进行中心化后的奇异值分解,即从每个特征列( X的每一列)中减去每个特征的平均值。主成分是右奇异向量,它们是 V t 在现在熟悉的分解中 X = U Σ V t

Principal component analysis is widely popular for data analysis. It is used for dimension reduction and clustering in unsupervised machine learning. In a nutshell, it is the singular value decomposition performed on the data matrix X, after centering the data, which means subtracting the average value of each feature from each feature column (each column of X). The principal components are then the right singular vectors, which are the rows of V t in the now familiar decomposition X = U Σ V t .

统计学家喜欢使用方差或数据变化的语言来描述主成分分析,并且使数据不相关。他们最终使用数据协方差矩阵的特征向量。这是统计学中主成分分析的常见描述:

Statisticians like to describe the principal component analysis using the language of variance, or variation in the data, and uncorrelating the data. They end up working with the eigenvectors of the covariance matrix of the data. This is a familiar description of principal component analysis in statistics:

它是一种降低数据集维度的方法,同时保留尽可能多的可变性或统计信息。保留尽可能多的变异性意味着寻找新特征,这些新特征是数据集特征的线性组合,连续最大化方差并且彼此不相关。

It is a method that reduces the dimensionality of a dataset, while preserving as much variability, or statistical information, as possible. Preserving as much variability as possible translates into finding new features that are linear combinations of those of the dataset, that successively maximize variance and that are uncorrelated with each other.

这两个描述(中心数据的右奇异向量和协方差矩阵的特征向量)完全相同,因为 V t 是特征向量 X Centered t X Centered ,这又是数据的协方差矩阵。此外,统计学中的术语“不相关”对应于数学和线性代数中的对角化,奇异值分解表示任何矩阵都像对角矩阵一样,即 Σ ,当写入一组新的坐标时,即 V t ,它们是V的列。

The two descriptions (the right singular vectors of the centered data, and the eigenvectors of the covariance matrix) are exactly the same, since the rows of V t are the eigenvectors of X centered t X centered , which in turn is the covariance matrix of the data. Moreover, the term uncorrelating in statistics corresponds to diagonalizing in mathematics and linear algebra, and the singular value decomposition says that any matrix acts like a diagonal matrix, namely Σ , when written in a new set of coordinates, namely the rows of V t , which are the columns of V.

让我们详细解释一下。假设X是中心数据矩阵,其奇异值分解为 X = U Σ V t 。这与 X V = U Σ ,或者,当我们按列解析表达式时, X V C = σ U C 。注意 X V C 只是使用V的特定列的条目的数据特征的线性组合。现在,忠实于我们在本章中一直所做的事情,我们可以丢弃不太重要的组件,这意味着VU的列对应于较低的奇异值。

Let’s explain this in detail. Suppose X is a centered data matrix and its singular value decomposition is X = U Σ V t . This is the same as X V = U Σ , or, when we resolve the expression by column, X V col i = σ i U col i . Note that X V col i is just a linear combination of the features of the data using the entries of that particular column of V. Now, faithful to what we have been doing all along in this chapter, we can throw away the less significant components, meaning the columns of V and U corresponding to the lower singular values.

现在假设我们的数据有 200 个特征,但只有 2 个奇异值是重要的,因此我们决定仅保留V的前 2 列和U的前 2 列。因此,我们将特征的维度从 200 减少到 2。第一个新特征是使用V第一列的条目对所有原始 200 个特征进行线性组合,但这正是 σ 1 U C 1 ,第二个新特征是使用V第二列的条目对所有原始 200 个特征进行线性组合,但这正是 σ 2 U C 2

Suppose now that our data has 200 features, but only 2 singular values are significant, so we decide to only keep the first 2 columns of V and the first 2 columns of U. Thus, we have reduced the dimension of the features from 200 to 2. The first new feature is a linear combination of all the original 200 features using the entries of the first column of V, but that is exactly σ 1 U col 1 , and the second new feature is a linear combination of all the original 200 features using the entries of the second column of V, but that is exactly σ 2 U col 2 .

现在让我们考虑一下各个数据点。数据矩阵X中的一个数据点有 200 个特征。这意味着我们需要 200 个轴来绘制该数据点。然而,采用我们之前仅使用前两个主成分执行的降维,该数据点现在将只有两个坐标,它们是 σ 1 U C 1 σ 2 U C 2 。因此,如果这是数据集中的第三个数据点,那么它的新坐标将是 σ 1 U C 1 和第三个条目 σ 2 U C 2 。现在可以很容易地在二维空间中绘制该数据点,而不是在原始 200 维空间中绘制它。

Now let’s think of individual data points. A data point in the data matrix X has 200 features. This means that we need 200 axes to plot this data point. However, taking the dimension reduction we performed previously using only the first two principal components, this data point will now have only two coordinates, which are the corresponding entries of σ 1 U col 1 and σ 2 U col 2 . So if this was the third data point in the data set, then its new coordinates will be the third entry of σ 1 U col 1 and the third entry of σ 2 U col 2 . Now it is easy to plot this data point in a two-dimensional space, as opposed to plotting it in the original 200-dimensional space.

我们选择保留多少个奇异值(以及主成分)。我们保留的越多,我们就越忠实于原始数据集,但维度当然会更高。这种截断决策(找到奇异值截断的最佳阈值)是正在进行的研究的主题。常见的方法是提前确定所需的排名,或者在原始数据中保留一定量的方差。其他技术绘制所有奇异值,观察图表中的明显变化,并决定在该位置截断,希望将数据中的基本模式与噪声分开。

We choose how many singular values (and thus principal components) to retain. The more we keep, the more faithful to the original data set we would be, but of course the dimension will be higher. This truncation decision (finding the optimal threshold for singular value truncation) is the subject of ongoing research. The common method is determining the desired rank ahead of time, or keeping a certain amount of variance in the original data. Other techniques plot all the singular values, observe an obvious change in the graph, and decide to truncate at that location, hopefully separating the essential patterns in the data from the noise.

重要的是不仅要对数据进行居中,还要对其进行标准化:减去每个特征的平均值并除以标准差。原因是奇异值分解对特征测量的尺度敏感。当我们标准化数据时,我们最终使用相关矩阵而不是协方差矩阵。为了避免混淆,要记住的要点是,我们对标准化数据集进行奇异值分解,然后主成分是V的列,数据点的新坐标是的条目 σ U C

It is important not to only center the data, but to also standardize it: subtract the mean of each feature and divide by the standard deviation. The reason is that the singular value decomposition is sensitive to the scale of the feature measurements. When we standardize the data, we end up working with the correlation matrix instead of the covariance matrix. To not confuse ourselves, the main point to keep in mind is that we perform the singular value decomposition on the standardized data set, then the principal components are the columns of V, and the new coordinates of the data points are the entries of σ i U col i .

主成分分析和聚类

Principal Component Analysis and Clustering

我们看到在上一节我们如何使用主成分分析来减少数据的特征数量,根据数据的变化以层次顺序提供一组新的特征。这对于可视化数据非常有用,因为我们只能在二维或三维中可视化。能够可视化高维数据(例如遗传数据)中的模式和相关性非常重要。有时,在由主成分确定的降维空间中,数据按类别存在固有的聚类。例如,如果数据集包含癌症患者和非癌症患者,以及他们的基因表达(通常为数千),我们可能会注意到,在前三个主成分空间中绘制数据,癌症患者与癌症患者分开聚类没有癌症的患者。

We saw in the previous section how we can use principal component analysis to reduce the number of features of the data, providing a new set of features in hierarchical order in terms of variation in the data. This is incredibly useful for visualizing data, since we can only visualize in two or three dimensions. It is important to be able to visualize patterns and correlations in high-dimensional data, for example, in genetic data. Sometimes, in the reduced dimensional space determined by the principal components, there is an inherent clustering of the data by category. For example, if the data set contains both patients with cancer and patients without cancer, along with their genetic expression (usually in the thousands), we might notice that plotting the data in the first three principal components space, patients with cancer cluster separately from patients without cancer.

社交媒体应用程序

A Social Media Application

在里面与主成分分析和聚类的本质相同,Dan Vilenchik最近发表的一篇文章(2020 年 12 月)介绍了社交媒体的一个精彩应用:一种在在线社交媒体平台上描述用户特征的无监督方法。以下是他就该主题发表的演讲摘要以及他的出版物的摘要:

In the same essence of principal component analysis and clustering, a recent publication (Dec 2020) by Dan Vilenchik presents a wonderful application from social media: an unsupervised approach to characterizing users in online social media platforms. Here’s the abstract from a talk he gave on the subject, along with the abstract from his publication:

理解从在线社交媒体或电子学习平台等在线平台自动收集的数据是一项具有挑战性的任务:数据是海量的、多维的、嘈杂的和异构的(由行为不同的个体组成)。在本次演讲中,我们重点关注所有在线社交平台共同的中心任务,即用户特征描述的任务。例如,自动识别 Twitter 上的垃圾邮件发送者或机器人,或者电子学习平台上的不积极的学生。

在线社交媒体渠道在我们的生活中发挥着核心作用。描述社交网络中的用户特征是一个长期存在的问题,可以追溯到 20 世纪 50 年代,当时 Katz 和 Lazarsfeld 研究“大众传播”中的影响力。在机器学习时代,这个任务通常被视为监督学习问题,其中需要预测目标变量:年龄、性别、政治倾向、收入等。在本次演讲中,我们探讨了在无监督学习中可以实现什么方式。具体来说,我们利用主成分分析来了解某些社交媒体平台固有的潜在模式和结构,而其他社交媒体平台则不然,以及原因。我们得出了类似辛普森的悖论,这可能会让我们更深入地了解此类平台中用户表征的数据驱动过程。

Making sense of data that is automatically collected from online platforms such as online social media or e-learning platforms is a challenging task: the data is massive, multidimensional, noisy, and heterogeneous (composed of differently behaving individuals). In this talk we focus on a central task common to all on-line social platforms and that is the task of user characterization. For example, automatically identify a spammer or a bot on Twitter, or a disengaged student in an e-learning platform.

Online social media channels play a central role in our lives. Characterizing users in social networks is a long-standing question, dating back to the 50’s when Katz and Lazarsfeld studied influence in “Mass Communication”. In the era of Machine Learning, this task is typically cast as a supervised learning problem, where a target variable is to be predicted: age, gender, political incline, income, etc. In this talk we explore what can be achieved in an unsupervised manner. Specifically, we harness principal component analysis to understand what underlying patterns and structures are inherent to some social media platforms, but not to others, and why. We arrive at a Simpson-like paradox that may give us a deeper understanding of the data-driven process of user characterization in such platforms.

用于创建数据中具有最大方差的聚类的主成分分析的想法将在整个过程中多次出现这本书。

The idea of principal component analysis for creating clusters with maximal variance in the data will appear multiple times throughout this book.

潜在语义分析

Latent Semantic Analysis

自然语言数据(文档)的语义分析类似于数值数据的主成分分析。

Latent semantic analysis for natural language data (documents) is similar to principal component analysis for numerical data.

在这里,我们想要分析一组文档与其包含的单词之间的关系。潜在语义分析的分布假设指出,具有相似含义的单词出现在相似的文本片段中,因此也出现在相似的文档中。计算机只能理解数字,因此在对 Word 文档进行任何分析之前,我们必须先给出它们的数字表示。一种这样的表示形式是字数矩阵X:列代表唯一的单词(例如苹果、橙子、狗、城市、情报等),行代表每个文档。这样的矩阵非常大但非常稀疏(有很多零)。单词太多(这些是特征),因此我们需要减少特征的维度,同时保留文档(数据点)之间的相似性结构。现在我们知道该怎么做了:对字数矩阵进行奇异值分解, X = U Σ V t ,然后丢弃较小的奇异值以及U中的相应列和 U 中的相应行 V t 。我们现在可以在低维空间(单词的线性组合)中表示每个文档,就像主成分分析允许在低维特征空间中表示数据一样。

Here, we want to analyze the relationships between a set of documents and the words they contain. The distributional hypothesis for latent semantic analysis states that words that have similar meaning occur in similar pieces of text, and hence in similar documents. Computers only understand numbers, so we have to come up with a numerical representation of our word documents before doing any analysis on them. One such representation is the word count matrix X: the columns represent unique words (such as apple, orange, dog, city, intelligence, etc.) and the rows represent each document. Such a matrix is very large but very sparse (has many zeros). There are too many words (these are the features), so we need to reduce the dimension of the features while preserving the similarity structure among the documents (the data points). By now we know what to do: perform the singular value decomposition on the word count matrix, X = U Σ V t , then throw away the smaller singular values along with the corresponding columns from U and rows from V t . We can now represent each document in the lower-dimensional space (of linear combinations of words) in exactly the same way principal component analysis allows for data representation in lower-dimensional feature space.

一旦我们减少了维度,我们最终可以使用余弦相似度来比较文档:计算表示文档的两个向量之间的角度的余弦。如果余弦接近 1,则文档在单词空间中指向同一方向,因此表示非常相似的文档。如果余弦接近 0,则表示文档的向量彼此正交,因此彼此非常不同。

Once we have reduced the dimension, we can finally compare the documents using cosine similarity: compute the cosine of the angle between the two vectors representing the documents. If the cosine is close to 1, then the documents point in the same direction in word space and hence represent very similar documents. If the cosine is close to 0, then the vectors representing the documents are orthogonal to each other and hence are very different from each other.

在早期,谷歌搜索更像是一个索引,后来它演变为接受更多自然语言搜索。智能手机自动完成也是如此。潜在语义分析将句子或文档的含义压缩为向量,当将其集成到搜索引擎中时,它可以显着提高引擎的质量,检索我们正在查找的确切文档寻找。

In its early days, Google search was more like an index, then it evolved to accept more natural language searches. The same is true for smartphone autocomplete. Latent semantic analysis compresses the meaning of a sentence or a document into a vector, and when this is integrated into a search engine, it dramatically improves the quality of the engine, retrieving the exact documents we are searching for.

随机奇异值分解

Randomized Singular Value Decomposition

在这个本章中,我们故意避免计算奇异值分解,因为它很昂贵。然而,我们确实提到,常见算法使用称为 QR 分解的矩阵分解(它获得数据矩阵列的正交基),然后使用 Householder 反射转换为双对角矩阵,最后使用迭代方法来计算所需的特征向量和特征值。遗憾的是,对于不断增长的数据集,即使对于这些高效的算法而言,所涉及的矩阵也太大了。我们唯一的救赎是通过随机线性代数。该领域依赖于随机采样理论,为矩阵分解提供了极其有效的方法。随机数值方法创造了奇迹,提供了精确的矩阵分解,同时比确定性方法便宜得多。随机奇异值分解对大数据矩阵X的列空间进行采样,计算采样的(小得多)矩阵的QR分解,将X投影到较小的空间( = t X , 所以 X ),然后计算Y的奇异值分解( = U Σ V t )。矩阵Q是正交的并且近似于X的列空间,因此矩阵 Σ V对于XY是相同的。为了找到XU ,我们可以根据Y 和Q 的U来计算它 U X = U

In this chapter, we have avoided computing the singular value decomposition on purpose because it is expensive. We did mention, however, that common algorithms use a matrix decomposition called QR decomposition (which obtains an orthonormal basis for the columns of the data matrix), then Householder reflections to transform to a bidiagonal matrix, and finally iterative methods to compute the required eigenvectors and eigenvalues. Sadly, for the ever-growing data sets, the matrices involved are too large even for these efficient algorithms. Our only salvation is through randomized linear algebra. This field provides extremely efficient methods for matrix decomposition, relying on the theory of random sampling. Randomized numerical methods work wonders, providing accurate matrix decompositions while at the same time being much cheaper than deterministic methods. Randomized singular value decomposition samples the column space of the large data matrix X, computes the QR decomposition of the sampled (much smaller) matrix, projects X onto the smaller space ( Y = Q t X , so X Q Y ), then computes the singular value decomposition of Y ( Y = U Σ V t ). The matrix Q is orthonormal and approximates the column space of X, so the matrices Σ and V are the same for X and Y. To find the U for X, we can compute it from the U for Y and Q U X = Q U Y .

与所有随机方法一样,它们必须伴随着误差界限,即原始矩阵X与采样QY相差多远的期望。我们确实有这样的误差界限,但我们将它们推迟到第 7 章,该章讨论大型随机矩阵。

Like all randomized methods, they have to be accompanied by error bounds, in terms of the expectation of how far off the original matrix X is from the sampled QY. We do have such error bounds, but we postpone them until Chapter 7, which discusses large random matrices.

总结与展望

Summary and Looking Ahead

本章的亮点是一个公式:

The star of the show in this chapter was one formula:

X = U Σ V t = σ 1 U C 1 V rw 1 t + σ 2 U C 2 V rw 2 t + + σ r U C r V rw r t

这相当于 X V = U Σ X V C = σ U C

This is equivalent to X V = U Σ and X V col i = σ i U col i .

奇异值分解的强大之处在于它可以在不丢失基本信息的情况下进行降级。这使我们能够压缩图像,减少数据集特征空间的维度,并在自然语言处理中计算文档相似度。

The power of the singular value decomposition is that it allows for rank reduction without losing essential information. This enables us to compress images, reduce the dimension of the feature space of a data set, and compute document similarity in natural language processing.

我们讨论了主成分分析、潜在语义分析和主成分空间固有的聚类结构。我们讨论了一些示例,其中主成分分析被用作癌症患者的无监督聚类技术,根据其基因表达以及社交媒体用户的特征。

We discussed principal component analysis, latent semantic analysis, and the clustering structure inherent in the principal component space. We discussed examples where principal component analysis was used as an unsupervised clustering technique for cancer patients according to their gene expression, as well as characterizing social media users.

我们以随机奇异值分解结束了这一章,强调了本书反复出现的主题:当事物太大时,对它们进行采样。随机性非常可靠!

We ended the chapter with randomized singular value decomposition, highlighting a recurring theme for this book: when things are too large, sample them. Randomness is pretty much dependable!

如果您有兴趣深入研究,您可以阅读有关张量分解和 N 路数据数组的内容,以及数据对齐对于奇异值分解正常工作的重要性。我从《深度学习》一书中学到了这些东西。如果您对其他流行的例子感兴趣,您可以从现代的角度阅读有关特征脸的内容。

If you are interested in diving deeper, you can read about tensor decompositions and N-way data arrays, and the importance of data alignment for the singular value decomposition to work properly. I learned this stuff from the book Deep Learning. If you are interested in other popular examples, you can read about eigenfaces from a modern perspective.

第 7 章自然语言和金融人工智能:矢量化和时间序列

Chapter 7. Natural Language and Finance AI: Vectorization and Time Series

他们。能。读。

H。

They. Can. Read.

H.

之一人类智力的标志是我们在很小的时候就掌握了语言:对书面和口头语言的理解,书面和口头的思想表达,两个或更多人之间的对话,从一种语言翻译成另一种语言,以及使用语言来表达意思。表达同理心、传达情感并处理从周围环境感知到的视觉和音频数据。撇开意识的哲学问题不谈,如果机器获得执行这些语言任务的能力,破译单词的意图,达到与人类相似或高于人类的水平,那么它就是通用人工智能的主要推动力。这些任务属于自然语言处理计算语言学机器学习和/或概率语言建模的范畴。这些领域广阔,很容易发现感兴趣的人在各种前景广阔的模型的迷雾中漫无目的地徘徊。我们不应该迷路。本章的目的是一次性对自然加工领域进行全面布局,以便我们能够鸟瞰整个领域,而不会陷入困境。

One of the hallmarks of human intelligence is our mastery of language at a very early age: comprehension of written and spoken language, written and spoken expression of thoughts, conversation between two or more people, translation from one language to another, and the use of language to express empathy, convey emotions, and process visual and audio data perceived from our surroundings. Leaving the philosophical question of consciousness aside, if machines acquire the ability to perform these language tasks, deciphering the intent of words, at a level similar to humans, or above humans, then it is a major propeller toward general artificial intelligence. These tasks fall under the umbrellas of natural language processing, computational linguistics, machine learning, and/or probabilistic language modeling. These fields are vast and it is easy to find interested people wandering aimlessly in a haze of various models with big promises. We should not get lost. The aim of this chapter is to lay out the natural processing field all at once so we can have a bird’s-eye view without getting into the weeds.

以下问题始终指导我们:

The following questions guide us at all times:

  • 手头的任务是什么类型?换句话说,我们的目标是什么?

  • What type of task is at hand? In other words, what is our goal?

  • 手头有什么类型的数据?我们需要收集什么类型的数据?

  • What type of data is at hand? What type of data do we need to collect?

  • 有哪些最先进的模型可以处理类似的任务和类似的数据类型?如果没有,那么我们就必须自己想出模型。

  • What state-of-the-art models are out there that deal with similar tasks and similar types of data? If there are none, then we have to come up with the models ourselves.

  • 我们如何训练这些模型?他们以什么格式消费数据?他们以什么格式产生输出?它们有训练函数、损失函数(或目标函数)和优化结构吗?

  • How do we train these models? In what formats do they consume their data? In what formats do they produce their outputs? Do they have a training function, loss function (or objective function), and optimization structure?

  • 与其他模型相比,各种模型的优点和缺点是什么?

  • What are the advantages and disadvantages of various models versus others?

  • 是否有可用于模型实现的 Python 包或库?幸运的是,如今大多数模型都附带 Python 实现和非常简单的 API。更好的是,有许多预先训练的模型可供下载并准备在应用程序中使用。

  • Are there Python packages or libraries available for model implementation? Luckily, nowadays, most models come out accompanied with Python implementations and very simple APIs. Even better, there are many pre-trained models available to download and ready for use in applications.

  • 我们需要多少计算基础设施来训练和/或部署这些模型?

  • How much computational infrastructure do we need to train and/or deploy these models?

  • 我们可以做得更好吗?总是有改进的空间。

  • Can we do better? There is always room for improvement.

我们还需要从表现最好的模型中提取数学知识。值得庆幸的是,这是很容易的部分,因为许多模型都基于类似的数学,即使涉及不同类型的任务或不同的应用领域,例如预测句子中的下一个单词或预测股票市场行为。

We also need to extract the math from the best-performing models. Thankfully, this is the easy part, since similar mathematics underlies many models, even when relating to different types of tasks or from dissimilar application areas, such as predicting the next word in a sentence or predicting stock market behavior.

我们打算在本章中介绍的最先进的模型是:

The state-of-the-art models that we intend to cover in this chapter are:

  • Transformer 或注意力模型(自 2017 年起)。这里重要的数学非常简单:两个向量之间的点积。

  • Transformers or attention models (since 2017). The important math here is extremely simple: the dot product between two vectors.

  • 循环长短期记忆神经网络(自 1995 年起)。这里重要的数学是时间反向传播。我们在第 4 章中介绍了反向传播,但对于循环网络,我们采用相对于时间的导数。

  • Recurrent long short-term memory neural networks (since 1995). The important math here is backpropagation in time. We covered backpropagation in Chapter 4, but for recurrent nets, we take the derivatives with respect to time.

  • 用于时间序列数据的卷积神经网络(自 1989 年起)。重要的数学是卷积运算,我们在第 5 章中介绍过。

  • Convolutional neural networks (since 1989) for time series data. The important math is the convolution operation, which we covered in Chapter 5.

这些模型非常适合对于时间序列数据,即随时间顺序出现的数据。时间序列数据的示例包括电影、音频文件(例如音乐和录音)、金融市场数据、气候数据、动态系统数据、文档和书籍。

These models are very well suited for time series data, that is, data that appears sequentially with time. Examples of time series data are movies, audio files such as music and voice recordings, financial markets data, climate data, dynamic systems data, documents, and books.

我们可能想知道为什么文档和书籍可以被认为是与时间相关的,即使它们已经写好并且就在那里。为什么图像不依赖于时间,而一本书以及一般的阅读和写作却依赖于时间?答案很简单:

We might wonder why documents and books can be considered as time dependent, even though they have already been written and are just there. How come an image is not time dependent but a book and, in general, reading and writing are? The answer is simple:

  • 当我们读一本书时,我们一次理解一个单词,然后一次一个短语,然后一次一个句子,然后一次一个段落,等等。这就是我们掌握本书概念和主题的方式。

  • When we read a book, we comprehend what we read one word at a time, then one phrase at a time, then one sentence at a time, then one paragraph at a time, and so on. This is how we grasp the concepts and topics of the book.

  • 当我们编写文档时也是如此,一次输出一个单词,即使我们试图表达的整个想法已经存在,在我们按顺序输出单词之前,已经编码在纸上。

  • The same is true when we write a document, outputting one word at time, even though the whole idea we are trying to express is already there, encoded, before we output the words, sequentially, on paper.

  • 当我们为图像添加标题时,图像本身与时间无关,但我们的标题(输出)却是。

  • When we caption an image, the image itself is not time dependent, but our captioning (the output) is.

  • 当我们总结一篇文章、回答一个问题或从一种语言翻译成另一种语言时,输出文本是与时间相关的。如果使用循环神经网络处理,输入文本可能是时间相关的,或者如果使用变压器或卷积一次性处理,则输入文本可能是固定的模型。

  • When we summarize an article, answer a question, or translate from one language to another, the output text is time dependent. The input text could be time dependent if processed using a recurrent neural network, or stationary if processed all at once using a transformer or a convolutional model.

直到 2017 年,处理时间序列数据的最流行的机器学习模型要么基于卷积神经网络,要么基于具有长短期记忆的循环神经网络。2017 年,变压器占据了主导地位,在某些应用领域完全放弃了循环。循环神经网络是否已经过时的问题是存在的,但随着人工智能领域的情况每天都在变化,谁知道哪些模型会消亡,哪些模型会经受住时间的考验。此外,循环神经网络为许多人工智能引擎提供动力,并且仍然是积极研究的主题。

Until 2017, the most popular machine learning models to process time series data were based either on convolutional neural networks or on recurrent neural networks with long short-term memory. In 2017, transformers took over, abandoning recurrence altogether in certain application areas. The question of whether recurrent neural networks are obsolete is out there, but with things changing every day in the AI field, who knows which models will die and which will survive the test of time. Moreover, recurrent neural networks power many AI engines and are still subjects of active research.

在本章中,我们回答以下问题:

In this chapter, we answer the following questions:

  • 我们如何将自然语言文本转换为保留意义的数字量?我们的机器只能理解数字,我们需要使用这些机器来处理自然语言。我们必须对文本数据样本进行向量化,或者将它们嵌入到有限维向量空间中。

  • How do we transform natural language text to numerical quantities that retain meaning? Our machines only understand numbers, and we need to process natural language using these machines. We must vectorize our samples of text data, or embed them into finite dimensional vector spaces.

  • 我们如何降低表示自然语言最初所需的巨大向量的维度?例如,法语有大约 135,000 个不同的单词,那么我们如何避免使用每个包含 135,000 个条目的向量在法语句子中进行单热编码单词呢?

  • How do we lower the dimension of the vectors from the enormous ones initially required to represent natural language? For example, the French language has around 135,000 distinct words, so how do we get around having to one-hot code words in a French sentence using vectors of 135,000 entries each?

  • 手头的模型是否将我们的自然语言数据(作为其输入和/或输出)视为一次输入一项的时间相关序列,或者一次消耗全部静态向量?

  • Does the model at hand consider (as its input and/or output) our natural language data as a time dependent sequence fed into it one term at a time, or a stationary vector consumed all at once?

  • 自然语言处理的各种模型到底是如何工作的?

  • How exactly do various models for natural language processing work?

  • 为什么这一章也有金融呢?

  • Why is there finance in this chapter as well?

在此过程中,我们讨论了我们的模型非常适合的自然语言和金融应用程序的类型。我们将重点放在数学而不是编程上,因为此类模型(尤其是语言应用程序)需要大量的计算基础设施。例如,DeepL Translator使用冰岛水力发电的超级计算机生成翻译,其速度达到 5.1 petaflops。我们还注意到,在 NVIDIA、谷歌的张量处理单元、AWS Inferentia、AMD 的 Instinct GPU 以及 Cerebras 和 Graphcore 等初创公司的引领下,人工智能专用芯片行业正在蓬勃发展。虽然传统芯片一直难以跟上摩尔定律(该定律预测处理能力每 18 个月就会翻一番),但人工智能专用芯片却大幅超越了该定律。

Along the way, we discuss the types of natural language and finance applications that our models are well suited for. We keep the focus on the mathematics and not the programming, since such models (especially for language applications) require substantive computational infrastructures. For example, the DeepL Translator generates its translations using a supercomputer operated with hydropower from Iceland, which reaches 5.1 petaflops. We also note that the AI-specialized chip industry is booming, led by NVIDIA, Google’s Tensor Processing Unit, AWS Inferentia, AMD’s Instinct GPU, and startups like Cerebras and Graphcore. While conventional chips have struggled to keep pace with Moore’s law, which predicted a doubling of processing power every 18 months, AI-specialized chips have outpaced this law by a wide margin.

尽管我们不为本章编写代码,但我们注意到大多数编程都可以使用 Python 的 TensorFlow 和 Keras 库来完成。

Even though we do not write code for this chapter, we note that most programming can be accomplished using Python’s TensorFlow and Keras libraries.

在整个讨论过程中,我们必须注意我们是处于模型的训练阶段还是预测阶段(使用预训练的模型来执行任务)。此外,重要的是要区分我们的模型是否需要标记数据进行训练,例如英语句子及其法语翻译作为标签,或者可以从未标记数据中学习,例如根据单词计算单词的含义上下文。

Throughout our discussion, we have to be mindful of whether we are in the training phase of a model or in the prediction phase (using the pre-trained model to do tasks). Moreover, it is important to differentiate whether our model needs labeled data to be trained, such as English sentences along with their French translations as labels, or can learn from unlabeled data, such as computing the meanings of words from their contexts.

自然语言人工智能

Natural Language AI

自然的语言处理应用程序无处不在。这项技术已经融入到我们生活的方方面面,以至于我们认为它是理所当然的:在智能手机、数字日历、数字家庭助理、Siri、Alexa 等上使用应用程序时。以下列表部分改编自Hobson Lane、Hannes Hapke 和 Cole Howard 所著的优秀著作《自然语言处理实践》 (Manning 2019),展示了自然语言处理已变得多么不可或缺:

Natural language processing applications are ubiquitous. This technology has been integrated into so many aspects of our lives that we just take it for granted: when using apps on our smartphones, digital calendars, digital home assistants, Siri, Alexa, and others. The following list is partially adapted from the excellent book Natural Language Processing in Action by Hobson Lane, Hannes Hapke, and Cole Howard (Manning 2019), demonstrating how indispensable natural language processing has become:

  • 搜索和信息检索:网络、文档、自动完成、聊天机器人

  • Search and information retrieval: web, documents, autocomplete, chatbots

  • 电子邮件:垃圾邮件过滤器、电子邮件分类、电子邮件优先级

  • Email: spam filter, email classification, email prioritization

  • 编辑:拼写检查、语法检查、风格推荐

  • Editing: spelling check, grammar check, style recommendation

  • 情感分析:产品评论、客户关怀、社区士气监控

  • Sentiment analysis: product reviews, customer care, monitoring of community morale

  • 对话:聊天机器人、亚马逊 Alexa 等数字助理、日程安排

  • Dialog: chatbots, digital assistants such as Amazon’s Alexa, scheduling

  • 写作:索引、索引、目录

  • Writing: indexing, concordance, table of contents

  • 文本挖掘:摘要、知识提取,例如挖掘竞选活动的财务和自然语言数据(寻找政治捐助者之间的联系)、简历与工作匹配、医疗诊断

  • Text mining: summarization, knowledge extraction such as mining election campaigns’ finance and natural language data (finding connections between political donors), résumé-to-job matching, medical diagnosis

  • 法律:法律推理、先例检索、传票分类

  • Law: legal inference, precedent search, subpoena classification

  • 新闻:事件检测、事实检查、标题构成

  • News: event detection, fact checking, headline composition

  • 归因: 剽窃检测、文学取证、风格指导

  • Attribution: plagiarism detection, literary forensics, style coaching

  • 行为预测:金融应用、选举预测、营销

  • Behavior prediction: finance applications, election forecasting, marketing

  • 创意写作:电影剧本、诗歌、歌词、机器人驱动的金融和体育新闻报道

  • Creative writing: movie scripts, poetry, song lyrics, bot-powered financial and sports news stories

  • 字幕:计算机视觉与自然语言处理相结合

  • Captioning: computer vision combined with natural language processing

  • 翻译:谷歌翻译和 DeepL 翻译

  • Translation: Google Translate and DeepL Translate

尽管过去十年取得了令人瞩目的成就,但机器距离掌握自然语言还差得很远。所涉及的过程很乏味,需要细心的统计簿记和大量的记忆,就像人类需要记忆来掌握语言一样。这里的要点是:该领域有足够的创新和贡献空间。

Even though the past decade has brought impressive feats, machines are still nowhere close to mastering natural language. The processes involved are tedious, requiring attentive statistical bookkeeping and substantive memory, the same way humans require memory to master languages. The point here is: there is plenty of room for new innovations and contributions to the field.

语言模型最近已从手工编码转向数据驱动。它们不实现硬编码的逻辑和语法规则。相反,他们依赖于检测单词之间的统计关系。尽管语言学中有一个学派认为语法是人类与生俱来的属性,或者换句话说,语法被硬编码到我们的大脑中,但人类具有惊人的能力来掌握新语言,而无需遇到这些语言的任何语法规则。从个人经验来看,尝试学习一门新语言的语法似乎会阻碍学习过程,但请不要引用我的话。

Language models have recently shifted from handcoded to data driven. They do not implement hardcoded logical and grammar rules. Instead, they rely on detecting the statistical relationships between words. Even though there is a school of thought in linguistics that asserts grammar is an innate property for humans, or in other words, is hardcoded into our brains, humans have a striking ability to master new languages without ever encountering any grammatical rules for these languages. From personal experience, attempting to learn the grammar of a new language seems to impede the learning process, but do not quote me on that.

一项主要挑战是自然语言数据的维数极高。数千种语言中有数百万个单词。存在巨大的文档语料库,例如作者作品的整个集合、数十亿条推文、维基百科文章、新闻文章、Facebook 评论、电影评论等。第一个目标是减少维数,以便有效存储、处理和计算,同时避免丢失重要信息。这是人工智能领域的一个常见主题,人们不禁想知道,如果我们拥有无限的存储和计算能力,有多少数学创新将永远不会问世。基础设施。

One major challenge is that data for natural language is extremely high-dimensional. There are millions of words across thousands of languages. There are huge corpuses of documents, such as entire collections of authors’ works, billions of tweets, Wikipedia articles, news articles, Facebook comments, movie reviews, etc. A first goal is then to reduce the number of dimensions for efficient storage, processing, and computation, while at the same time avoiding the loss of essential information. This has been a common theme in the AI field, and one cannot help but wonder how many mathematical innovations would have never seen the light of the day had we possessed unlimited storage and computational infrastructures.

为机器处理准备自然语言数据

Preparing Natural Language Data for Machine Processing

为一个机器要处理任何自然语言任务,它必须做的第一件事就是分解文本并将其组织成保留含义、意图、上下文、主题、信息和情感的构建块。为此,它必须使用称为标记化、词干例如为单数词及其复数变体赋予相同标记)、词形还原(将几个具有相似含义的词关联在一起)、大小写标准化(例如给予相同拼写的大写和小写单词相同的标记),等等。这种对应关系不是针对组成单词的单个字符,而是针对完整的单词、单词对或更多单词(2-gram 或 n-gram)、标点符号、重要的大写字母等,这些都带有含义。这创造了与给定的自然语言文档语料库相对应的数字标记的词汇表或词典。从这个意义上来说,词汇表或词典类似于 Python 字典:每个单独的自然语言构建块对象都有一个唯一的标记。

For a machine to process any natural language task, the first thing it must do is to break down text and organize it into building blocks that retain meaning, intent, context, topics, information, and sentiments. To this end, it must establish a correspondence between words and number tags, using processes called tokenizing, stemming (such as giving singular words and their plural variation the same token), lemmatization (associating several words of similar meaning together), case normalization (such as giving capitalized and lowercase words of the same spelling the same tokens), and others. This correspondence is not for individual characters that make up words, but for full words, pairs or more of words (2-grams or n-grams), punctuations, significant capitalizations, etc., that carry meaning. This creates a vocabulary or a lexicon of numerical tokens corresponding to a given corpus of natural language documents. A vocabulary or a lexicon in this sense is similar to a Python dictionary: each individual natural language building block object has a unique token.

n元语法由n 个单词组成的序列,当它们排列在一起时,其含义与每个单词本身的含义不同。例如,2-gram 是一对在一起的单词,如果我们取消配对,它们的含义就会改变,例如 Ice Cream 或 was not,因此整个 2-gram 得到一个数字标记,保留这两个单词在其内部的含义。正确的上下文。类似地,3-gram 是有序单词的三元组,例如 John F. Kennedy 等。一个解析器_自然语言与计算机的编译器相同。如果这些新术语让您感到困惑,请不要担心。出于数学目的,我们需要的只是与独特单词、n 元语法、表情符号、标点符号等相关的数字标记,以及自然语言文档语料库的最终词汇表。它们像对象一样保存在字典中,使我们能够轻松地在文本和数字标记之间来回翻转。

An n-gram is a sequence of n words that carry a meaning when kept ordered together that is different from the meaning of each word on its own. For example, a 2-gram is a couple of words together whose meaning would change if we unpair them, such as ice cream or was not, so the whole 2-gram gets one numerical token, retaining the meaning of the two words within their correct context. Similarly, a 3-gram is a triplet of ordered words, such as John F. Kennedy, and so on. A parser for natural language is the same as a compiler for computers. Do not worry if these new terms confuse you. For our mathematical purposes, all we need are the numerical tokens associated with unique words, n-grams, emojis, punctuation, etc., and the resulting vocabulary for a corpus of natural language documents. These are saved in a dictionary like objects, allowing us to flip back and forth easily between text and numerical tokens.

我们将标记化、词干提取、词形还原、解析和其他自然语言数据准备的实际细节留给计算机科学家及其与语言学家的合作。事实上,随着模型直接从数据中检测模式的能力的成熟,与语言学家的合作变得不再那么重要,因此,将手工语言规则编码到自然语言模型中的需求已经减少。另请注意,并非所有自然语言管道都包含词干提取和词形还原。然而,它们都涉及标记化。标记化文本数据的质量对于我们自然语言管道的性能至关重要。这是第一步,包含代表我们输入模型的数据的基本构建块。数据的质量及其标记方式都会影响整个自然语言处理管道的输出。对于您的生产应用程序,请使用spaCy解析器,它可以一次性执行句子分段、标记化和其他多项操作。

We leave the actual details of tokenizing, stemming, lemmatization, parsing, and other natural language data preparations for computer scientists and their collaborations with linguists. In fact, collaboration with linguists has become less important as the models mature in their ability to detect patterns directly from the data, thus, the need for coding handcrafted linguistic rules into natural language models has diminished. Note also that not all natural language pipelines include stemming and lemmatization. They all, however, involve tokenizing. The quality of tokenizing text data is crucial for the performance of our natural language pipeline. It is the first step containing fundamental building blocks representing the data that we feed into our models. The quality of both the data and the way it is tokenized affects the outputs of the entire natural language processing pipeline. For your production applications, use the spaCy parser, which does sentence segmentation, tokenization, and multiple other things in one pass.

在标记化并构建健康的词汇表(数字标记及其在自然语言文本中对应的实体的集合)之后,我们需要使用来表示整个自然语言文档数字的向量。这些文档的范围可以很长(例如系列丛书),也可以很短(例如 Twitter 推文或 Google 搜索或 DuckDuckGo 的简单搜索查询)。然后,我们可以将一百万个文档的语料库表示为一百万个数值向量的集合,或一个具有一百万列的矩阵。这些列将与我们选择的词汇表一样长,或者如果我们决定进一步压缩这些文档,则它们会更短。在线性代数语言中,这些向量的长度是我们的文档嵌入的向量空间的维度。

After tokenizing and building a healthy vocabulary (the collection of numerical tokens and the entities they correspond to in the natural language text), we need to represent entire natural language documents using vectors of numbers. These documents can range from very long, such as a book series, to very short, such as a Twitter tweet or a simple search query for Google Search or DuckDuckGo. We can then express a corpus of one million documents as a collection of one million numerical vectors, or a matrix with one million columns. These columns will be as long as our chosen vocabulary, or shorter if we decide to compress these documents further. In linear algebra language, the length of these vectors is the dimension of our vector space that our documents are embedded in.

这个过程的重点是获得我们文档的数值向量表示,以便我们可以对它们进行数学运算:现在有了线性代数及其线性组合、投影、点积和奇异值分解的库。然而,有一个警告:对于自然语言应用程序,表示我们的文档的向量的长度或我们的词汇量的大小对于进行任何有用的计算来说是非常巨大的。维度的诅咒变成了真实的事情。

The whole point of this process is to obtain numerical vector representations of our documents so that we can do math on them: now comes linear algebra with its arsenal of linear combinations, projections, dot products, and singular value decompositions. There is, however, one caveat: for natural language applications, the lengths of the vectors representing our documents, or the size of our vocabulary, are prohibitively enormous to do any useful computations with. The curse of dimensionality becomes a real thing.

维度的诅咒

The Curse of Dimensionality

向量变成随着维数的增加,欧几里德距离呈指数级拉远。一个自然语言示例是根据文档与另一文档的距离对文档进行排序,例如搜索查询。当维度超过 20 个左右时,如果我们使用欧几里得距离来衡量文档的紧密度,这个简单的操作就变得不切实际了(更多详细信息,请参阅维基百科的“维数诅咒” )。因此,对于自然语言应用程序,我们必须使用另一种度量文档之间的距离。我们将很快讨论余弦相似度,它测量两个文档向量之间的角度,而不是它们的欧几里得距离。

Vectors become exponentially farther apart in terms of Euclidean distance as the number of dimensions increases. One natural language example is sorting documents based on their distance from another document, such as a search query. This simple operation becomes impractical when we go above 20 dimensions or so if we use the Euclidean distance to measure the closeness of documents (see Wikipedia’s “curse of dimensionality” for more details). Thus, for natural language applications, we must use another measure for distance between documents. We will discuss cosine similarity shortly, which measures the angle between two document vectors, as opposed to their Euclidean distance.

因此,自然语言处理模型的主要驱动力是使用较短的向量来表示这些文档,以传达主要主题并保留含义。想想我们必须使用多少个独特的标记或标记组合来代表这本书,同时保留其最重要的信息。

Therefore, a main driver for natural language processing models is to represent these documents using shorter vectors that convey the main topics and retain meaning. Think how many unique tokens or combinations of tokens we have to use to represent this book while at the same time preserving its most important information.

总而言之,我们的自然语言处理流程如下:

To summarize, our natural language processing pipeline proceeds as follows:

  1. 从文本到数字标记,再到整个文档语料库可接受的词汇表。

  2. From text to numerical tokens, then to an acceptable vocabulary for an entire corpus of documents.

  3. 从令牌文档到数字的高维向量。

  4. From documents of tokens to high-dimensional vectors of numbers.

  5. 使用直接投影到词汇空间的较小子集(仅删除部分词汇,使相应条目为零)、潜在语义分析(投影到特殊向量上)等技术,从数字的高维向量到主题的低维向量由文档向量的特殊线性组合确定)、word2vecDoc2Vec思想向量狄利克雷分配等。我们很快就会讨论这些。

  6. From high-dimensional vectors of numbers to lower-dimensional vectors of topics using techniques like direct projection onto a smaller subset of the vocabulary space (just dropping part of the vocabulary, making the corresponding entries zero), latent semantic analysis (projecting onto special vectors determined by special linear combinations of the document vectors), word2vec, Doc2Vec, thought vectors, Dirichlet allocation, and others. We discuss these shortly.

正如数学建模中通常的情况一样,有不止一种方法可以将给定文档表示为数字向量。我们决定我们的文档所在的向量空间,或者嵌入到 中。每种向量表示都有优点和缺点,具体取决于我们自然语言任务的目标。有些比以下更简单其他人也是如此。

As is usually the case in mathematical modeling, there is more than one way to represent a given document as a vector of numbers. We decide on the vector space that our documents inhabit, or get embedded in. Each vector representation has advantages and disadvantages, depending on the goal of our natural language task. Some are simpler than others too.

统计模型和对数函数

Statistical Models and the log Function

什么时候将文档表示为数字向量,首先计算某些术语在文档中出现的次数,然后我们的文档向量化模型是统计的,因为它是基于频率的。

When representing a document as a vector of numbers starts with counting the number of times certain terms appear in the document, then our document vectorizing model is statistical, since it is frequency based.

当我们处理术语频率时,最好将对函数应用于我们的计数,而不是使用原始计数。当我们处理可能变得极大、极小或尺度变化极大的数量时,对数函数非常有利。在对数尺度内查看这些极端计数或变化会将它们带回到正常范围。

When we deal with term frequencies, it is better to apply the log function to our counts as opposed to using raw counts. The log function is advantageous when we deal with quantities that could get extremely large, extremely small, or could have extreme variations in scale. Viewing these extreme counts or variations within a logarithmic scale brings them back to the normal realm.

例如,数字 10 23 是巨大的,但是 日志 10 23 = 23 日志 10 不是。类似地,如果术语“鲨鱼”出现在 2000 万个文档的语料库中的 2 个文档中(2000 万/2 = 1000 万),并且术语“鲸鱼”出现在该语料库的 20 个文档中(2000 万/20 = 100 万) ),那么这是 900 万个差异,对于分别出现在 2 个和 20 个文档中的术语来说,这似乎过多。计算相同的数量,但在对尺度上,我们分别得到 7log(10) 和 6log(10)(无论我们使用哪个对数基数),这似乎不再过多,并且更符合术语在语料库中的出现情况。

For example, the number 10 23 is huge, but log ( 10 23 ) = 23 log ( 10 ) is not. Similarly, if the term “shark” appears in 2 documents of a corpus of 20 million documents (20 million/2 = 10 million), and the term “whale” appears in 20 documents of this corpus (20 million/20 = 1 million), then that is a 9 million difference, which seems excessive for terms that appeared in 2 and 20 documents, respectively. Computing the same quantities but on a log scale, we get 7log(10) and 6log(10), respectively (it doesn’t matter which log base we use), which doesn’t seem excessive anymore, and more in line with the terms’ occurrence in the corpus.

Zipf 定律强化了在处理字数统计时使用 log 函数的必要性。该定律表明,自然语言语料库中的术语计数自然遵循幂律,因此最好使用对数函数来调节这一点,将术语频率的差异转化为线性尺度。我们接下来讨论这个。

The need for using the log function when dealing with word counts in particular is reinforced by Zipf’s law. This law says that term counts in a corpus of natural language naturally follow a power law, so it is best to temper that with a log function, transforming differences in term frequencies into a linear scale. We discuss this next.

齐普夫术语计数定律

Zipf’s Law for Term Counts

齐普夫定律自然语言与字数有关。它非常有趣且令人惊讶,以至于我很想尝试看看它是否适用于我自己的书。很难想象,当我写下这本书中的每一个字时,我独特的字数统计实际上遵循着某种规律。我们以及我们表达想法和想法的方式是可预测的吗?事实证明,齐普夫定律适用于计算我们周围的许多事物,而不仅仅是文档和语料库中的单词。

Zipf’s law for natural language has to do with word counts. It is very interesting and so surprising that I am tempted to try and see if it applies to my own book. It is hard to imagine that as I write each word in this book, my unique word counts are actually following some law. Are we, along with the way we word our ideas and thoughts, that predictable? It turns out that Zipf’s law extends to counting many things around us, not only words in documents and corpuses.

齐普夫定律如下:对于自然语言语料库,其中术语根据频率排序,第一项的频率是第二项的两倍,第三项的三倍,依此类推。也就是说,某个项目在语料库中出现的频率与其排名相关: F 1 = 2 F 2 = 3 F 3 =

Zipf’s law reads as follows: for a corpus of natural language where the terms have been ordered according to their frequencies, the frequency of the first item is twice that of the second item, three times the third item, and so on. That is, the frequency with which an item appears in a corpus is related to its ranking: f 1 = 2 f 2 = 3 f 3 = . . .

我们可以通过绘制术语相对于各自等级的频率并验证幂律来验证齐普夫定律是否适用: F r = F r = F 1 r -1 。为了验证幂律,更容易制作双对数图,绘制 日志 F r 反对 日志 r 。如果我们在双对数图中得到一条直线,那么 F r = F r = F 1 r α , 在哪里 α 是直线的斜率。

We can verify if Zipf’s law applies by plotting the frequency of the terms against their respective ranks and verifying the power law: f r = f ( r ) = f 1 r -1 . To verify power laws, it is easier to make a log-log plot, plotting log ( f r ) against log ( r ) . If we obtain a straight line in the log-log plot, then f r = f ( r ) = f 1 r α , where α is the slope of the straight line.

自然语言文档的各种向量表示

Various Vector Representations for Natural Language Documents

让我们列出最先进的自然语言处理模型的最常见的文档向量表示。前两个术语频率术语频率-逆文档频率(TF-IDF) 是统计表示形式,因为它们是基于频率的,依赖于对文档中单词出现的计数。它们比检测某些单词是否存在的简单二进制表示稍微复杂一些,尽管如此,它们仍然很浅,只是计算单词。即使如此浅薄,它们对于垃圾邮件过滤和情感分析等应用程序仍然非常有用。

Let’s list the most common document vector representations for state-of-the-art natural language processing models. The first two, term frequency and term frequency-inverse document frequency (TF-IDF), are statistical representations since they are frequency based, relying on counting word appearances in documents. They are slightly more involved than a simple binary representation detecting the presence or nonpresence of certain words, nevertheless, they are still shallow, merely counting words. Even with this shallowness, they are very useful for applications such as spam filtering and sentiment analysis.

文档或词袋的词频向量表示

Term Frequency Vector Representation of a Document or Bag of Words

在这里,我们使用词袋表示文档,丢弃单词在文档中出现的顺序。尽管词序编码了有关文档内容的重要信息,但对于短句子和短语来说,忽略它通常是一个不错的近似值。

Here, we represent a document using a bag of words, discarding the order in which words appear in the document. Even though word order encodes important information about a document’s content, ignoring it is usually an OK approximation for short sentences and phrases.

假设我们想要将给定文档嵌入到包含10,000 个标记的词汇空间中。那么表示该文档的向量将有 10,000 个条目,每个条目都会计算每个特定标记在文档中出现的次数。出于显而易见的原因,这称为文档的术语频率词袋向量表示,其中每个条目都是一个非负整数(整数)。

Suppose we want to embed our given document in a vocabulary space of 10,000 tokens. Then the vector representing this document will have 10,000 entries, with each entry counting how many times each particular token appears in the document. For obvious reasons, this is called the term frequency or bag-of-words vector representation of the document, where each entry is a nonnegative integer (a whole number).

例如,Google 搜索查询“明天天气怎么样?” 除了代表单词“what”、“the”、“weather”和“tomorrow”(如果词汇表中存在)的标记处的 1 之外,其他所有位置都将被向量化为零。然后,我们对该向量进行归一化,将每个条目除以文档中的术语总数,以便文档的长度不会影响我们的分析。也就是说,如果一个文档有 50,000 个术语,并且“猫”这个术语被提及了 100 次,而另一个文档只有 100 个术语,并且“猫”这个术语被提及了 10 次,那么显然“猫”这个词更重要第二个文档的值要高于第一个文档的值,并且仅进行字数统计而不进行标准化将无法捕捉到这一点。

For example, the Google Search query “What’s the weather tomorrow?” will be vectorized as zeros everywhere except for ones at the tokens representing the words “what,” “the,” “weather,” and “tomorrow,” if they exist in the vocabulary. We then normalize this vector, dividing each entry by the total number of terms in the document so that the length of the document doesn’t skew our analysis. That is, if a document has 50,000 terms and the term “cat” gets mentioned a hundred times, and another document has a hundred terms only and the term “cat” gets mentioned 10 times, then obviously the word “cat” is more important for the second document than for the first, and a mere word count without normalizing would not be able to capture that.

最后,出于前两节中提到的原因,一些自然语言处理类会获取文档向量中每个术语的对数。

Finally, some natural language processing classes take the log of each term in the document vector for the reasons mentioned in the previous two sections.

文档的词频-逆文档频率向量表示

Term Frequency-Inverse Document Frequency Vector Representation of a Document

在这里,为了对于表示文档的向量的每个条目,我们仍然计算标记在文档中出现的次数,但然后除以语料库中出现标记的文档数

Here, for each entry of the vector representing the document, we still count the number of times the token appears in the document, but then we divide by the number of documents in our corpus in which the token occurs.

这个想法是,如果一个术语在一个文档中出现多次,而在其他文档中出现次数不多,那么该术语对于这个文档一定很重要,在表示该文档的向量的相应条目中获得更高的分数。

The idea is that if a term appears many times in one document and not as much in the others, then this term must be important for this one document, getting a higher score in the corresponding entry of the vector representing this document.

为了避免被零除,如果某个术语没有出现在任何文档中,通常的做法是在分母上加一。例如,令牌的逆文档频率为:

To avoid division by zero, if a term does not appear in any document, it is common practice to add one to the denominator. For example, the inverse document frequency of the token cat is:

以色列国防军 为了 = 数字文件语料库 数字文件含有+1

显然,使用 TF-IDF 表示,文档向量的条目将是非负有理数,每个条目都提供该特定标记对文档的重要性的度量。最后,出于与上一节中所述相同的原因,我们获取该向量中每个条目的日志。

Obviously, using TF-IDF representation, the entries of the document vectors will be nonnegative rational numbers, each providing a measure of the importance of that particular token to the document. Finally, we take the log of each entry in this vector, for the same reasons stated in the previous section.

有许多与信息检索系统相关的替代 TF-IDF 方法,例如Okapi BM25Molino 2017

There are many alternative TF-IDF approaches relevant to information retrieval systems, such as Okapi BM25, and Molino 2017.

通过潜在语义分析确定的文档的主题向量表示

Topic Vector Representation of a Document Determined by Latent Semantic Analysis

TF-IDF 载体维度非常高(维度与语料库中的标记一样多,因此可能有数百万)、稀疏,并且在相互相加或相减时没有特殊含义。我们需要更紧凑的向量,在数百维或更少,这是对数百万维的巨大挤压。除了降维优势之外,这些向量还捕获了一些含义,而不仅仅是字数和统计数据。我们称它们为主题向量。我们不关注文档中单词的统计,而是关注文档中单词之间以及跨语料库的连接的统计。这里产生的主题将是字数的线性组合。

TF-IDF vectors are very high-dimensional (as many dimensions as tokens in the corpus, so it could be in the millions), sparse, and have no special meaning when added or subtracted from each other. We need more compact vectors, in the hundreds of dimensions or less, which is a big squeeze from millions of dimensions. In addition to the dimension reduction advantage, these vectors capture some meaning, not only word counts and statistics. We call them topic vectors. Instead of focusing on the statistics of words in documents, we focus on the statistics of connections between words in documents and across corpuses. The topics produced here will be linear combinations of word counts.

首先,我们处理语料库的整个 TF-IDF 矩阵X ,产生我们的主题空间。在这种情况下处理意味着我们计算从线性代数对 TF-IDF 矩阵进行奇异值分解,即 X = U Σ V t 第 6 章专门讨论奇异值分解,因此现在我们只会解释它如何用于生成语料库的主题空间。线性代数的奇异值分解在自然语言处理中被称为潜在语义分析。我们将同义地使用这两个术语。

First, we process the whole TF-IDF matrix X of our corpus, producing our topic space. Processing in this case means that we compute the singular value decomposition of the TF-IDF matrix from linear algebra, namely, X = U Σ V t . Chapter 6 is dedicated to singular value decomposition, so for now we will only explain how it is used for producing our topic space for a corpus. Singular value decomposition from linear algebra is called latent semantic analysis in natural language processing. We will use both terms synonymously.

我们必须注意语料库的 TF-IDF 矩阵X的列是否代表单词标记或文档。不同的作者和软件包使用其中之一,因此我们必须小心并处理矩阵或其转置以产生我们的主题空间。在本节中,我们遵循以下表示:行是整个语料库的所有单词(单词的标记、n-gram 等),列是语料库中每个文档的 TF-IDF 向量表示。这与数据矩阵的通常表示略有不同,其中特征(每个文档中的单词)位于列中,实例(文档)位于行中。这种转变的原因很快就会显而易见。然而,这与我们将文档表示为列向量并没有不同。

We have to pay attention to whether the columns of the corpus’s TF-IDF matrix X represent the word tokens or the documents. Different authors and software packages use one or the other, so we must be careful and process either the matrix or its transpose to produce our topic space. In this section we follow the representation that the rows are all the words (tokens for words, n-grams, etc.) of the entire corpus, and the columns are the TF-IDF vector representations for each document in the corpus. This is slightly divergent from the usual representation of a data matrix, where the features (the words within each document) are in the columns, and the instances (the documents) are in the rows. The reason for this switch will be apparent shortly. However, this is not divergent from our representation for documents as column vectors.

接下来,给定一个具有 TF-IDF 向量表示的新文档,我们将其投影到由语料库的 TF-IDF 矩阵的奇异值分解生成的主题空间上,将其转换为更紧凑的主题向量。线性代数中的投影只是计算适当向量之间的点积,并将所得标量保存到新投影向量的条目中:

Next, given a new document with its TF-IDF vector representation, we convert it to a much more compact topic vector by projecting it onto the topic space produced by the singular value decomposition of the corpus’s TF-IDF matrix. Projecting in linear algebra is merely computing the dot product between the appropriate vectors and saving the resulting scalar numbers into the entries of a new projected vector:

  • 我们有一个文档的 TF-IDF 向量,其条目数与整个语料库中的标记数相同。

  • We have a TF-IDF vector of a document that has as many entries as the number of tokens in the entire corpus.

  • 我们有主题权重向量,它们是通过 TF-IDF 矩阵的奇异值分解产生的矩阵U的列 X = U Σ V t 。同样,每个主题权重向量的条目数量与我们语料库中的标记一样多。最初,我们在整个语料库(U的列)中也有与标记一样多的主题权重向量。U列中的权告诉我们某个 token 对主题的贡献有多大,如果它是接近 1 的正数,则贡献较大;如果接近 0,则为矛盾贡献;如果接近 0,则为负贡献。是接近-1的负数。请注意, U的条目始终是 -1 到 1 之间的数字,因此我们将它们解释为语料库标记的权重因子。

  • We have topic weight vectors, which are the columns of the matrix U produced by the singular value decomposition of the TF-IDF matrix X = U Σ V t . Again, each topic weight vector has as many entries as tokens in our corpus. Initially, we also have as many topic weight vectors as tokens in our entire corpus (columns of U). The weights in the column of U tell us how much a certain token contributes to the topic, with a big contribution if it is a positive number close to 1, an ambivalent contribution if it is close to 0, and even a negative contribution if it is a negative number close to -1. Note that the entries of U are always numbers between -1 and 1, so we interpret them as weighing factors for our corpus’s tokens.

选题与降维

Topic selection and dimension reduction

你可能会想知道,如果我们的语料库中有与令牌一样多的主题权重向量,每个主题权重向量也有与令牌一样多的条目,那么节省在哪里,以及何时会发生压缩或降维?继续阅读。

You might be wondering, if we have as many topic weight vectors as tokens in our corpus, each having as many entries as tokens as well, then where are the savings, and when will compression or dimension reduction happen? Keep reading.

目标1
Goal 1

计算我们的文档包含多少特定主题。这只是文档的 TF-IDF 向量与我们关心的主题对应的U列之间的点积。将此记录为第一个标量。

Compute how much of a certain topic our document contains. This is simply the dot product between the document’s TF-IDF vector and the column of U corresponding to the topic that we care for. Record this as the first scalar number.

目标2
Goal 2

计算我们的文档包含多少另一个主题。这是文档的 TF-IDF 向量与对应于我们关心的另一个主题的U列之间的点积。将此记录为第二个标量。

Compute how much of another topic our document contains. This is the dot product between the document’s TF-IDF vector and the column of U corresponding to this other topic that we care for. Record this as the second scalar number.

目标 3
Goal 3

对任意多(与U的列数相同,与语料库中的标记总数相同)或尽可能少(一个)主题重复此操作,记录我们计算的每个点积的标量。现在很清楚,本文中的“主题”是指一个列向量,其中包含分配给语料库中每个标记的 -1 到 1 之间的权重。

Repeat this for as many (as there are columns of U, which is the same as the total number of tokens in the corpus) or as few (as one) topics as we like, recording the scalar number from each dot product that we compute. It is clear now that “a topic” in this context means a column vector containing weights between -1 and 1 assigned to each token in the corpus.

目标 4
Goal 4

通过仅保留重要的主题来减少维度。也就是说,如果我们决定只保留两个主题,那么文档的压缩向量表示将是包含两个标量数的二维向量,这两个标量数是使用文档的 TF-IDF 向量和两个主题之间的两个点积生成的。权重向量。这样,我们就可以将文档的尺寸从可能数百万减少到只有两个。很酷的东西。

Reduce the dimension by keeping only the topics that matter. That is, if we decide to keep only two topics, then the compressed vector representation of our document will be the two-dimensional vector containing the two scalar numbers produced using the two dot products between the document’s TF-IDF vector and the two topics’ weight vectors. This way, we would have reduced the dimension of our document from possibly millions to just two. Pretty cool stuff.

目标 5
Goal 5

选择正确的主题来代表我们的文档。这就是奇异值分解发挥其魔力的地方。U的列按语料库中最重要的主题到最不重要的顺序组织。用统计学的语言来说,列是从整个语料库中方差最大的主题组织到方差最小的主题,因此编码更多信息,因此编码很少的信息。我们将在第 10 章中解释方差和奇异值分解的关系。因此,如果我们决定将高维文档仅投影到U的前几个列向量上,就可以保证在捕获整个语料库中可能主题的足够变化并评估其中有多少方面,我们不会错过太多。我们的文件包含。

Choose the right topics to represent our documents. This is where the singular value decomposition works its magic. The columns of U are organized in order, from the most important topic across the corpus to the least important. In the language of statistics, the columns are organized from the topic with the most variance across the corpus and hence encodes more information, to the one with the least variance and hence encodes little information. We explain how variance and singular value decomposition are related in Chapter 10. Thus, if we decide to project our high-dimensional document onto the first few column vectors of U only, we are guaranteed that we are not missing much in terms of capturing enough variation of possible topics across the corpus, and assessing how much of these our document contains.

目标 6
Goal 6

请理解,这仍然是一种捕获文档中主题的统计方法。我们从语料库的 TF-IDF 矩阵开始,简单地计算文档中标记的出现次数。从这个意义上说,仅基于引用相似事物的文档使用相似单词的前提来捕获主题。这与根据所用单词的含义来捕获主题不同。也就是说,如果我们有两个讨论同一主题但使用完全不同词汇的文档,它们在主题空间上将相距甚远。解决这个问题的方法是将单词与其他具有相似含义的单词存储在一起,这就是本章后面讨论的 word2vec 方法。

Understand that this is still a statistical method for capturing topics in a document. We started with the TF-IDF matrix of a corpus, simply counting token occurrences in documents. In this sense, a topic is captured based only on the premise that documents that refer to similar things use similar words. This is different from capturing topics based on the meanings of the words they use. That is, if we have two documents discussing the same topic but using entirely different vocabulary, they will be far apart in topic space. The remedy to this would be to store words together with other words of similar meaning, which is the word2vec approach, discussed later in this chapter.

问题1
Question 1

如果我们向语料库添加另一个文档会发生什么?幸运的是,我们不必重新处理整个语料库来生成文档的主题向量,我们只需将其投影到语料库的现有主题空间上即可。如果我们添加一个与我们的语料库没有任何共同点的新文档,例如将一篇关于纯数学的文章添加到莎士比亚爱情十四行诗的语料库中,这当然会失败。在这种情况下,我们的数学文章将由一堆零或接近于零的条目表示,这不能充分体现文章中的想法。

What happens if we add another document to our corpus? Luckily, we do not have to reprocess the whole corpus to produce the document’s topic vector, we just project it onto the corpus’s existing topic space. This of course breaks down if we add a new document that has nothing in common with our corpus, such as an article on pure mathematics added to a corpus on Shakespeare’s love sonnets. Our math article in this case will be represented by a bunch of zeros or close-to-zero entries, which does not capture the ideas in the article adequately.

现在我们有以下问题:

Now we have the following questions:

问题2
Question 2

矩阵呢 V t 在奇异值分解中 X = U Σ V t ,在我们的语料库的自然语言处理背景下这意味着什么?矩阵 V t 行数和列数与语料库中的文档数相同。它是文档-文档矩阵,给出了文档之间的共享含义。

What about the matrix V t in the singular value decomposition X = U Σ V t , what does it mean in the context of natural language processing of our corpus? The matrix V t has the same number of rows and columns as the number of documents in our corpus. It is the document-document matrix and gives the shared meaning between documents.

问题3
Question 3

当我们使用潜在语义分析转移到低维主题空间时,文档之间是否会保留较大的距离?是的,因为奇异值分解的重点是最大化语料库文档的方差。

When we move to a lower-dimensional topic space using latent semantic analysis, are large distances between documents preserved? Yes, since the singular value decomposition focuses on maximizing the variance across the corpus’s documents.

问题4
Question 4

是否保留了较小的距离,这意味着潜在语义分析是否保留了文档的精细结构,从而将其与其他不太不同的文档区分开来?不。稍后讨论的潜在狄利克雷分配在这里做得更好。

Are small distances preserved, meaning does latent semantic analysis preserve the fine structure of a document that separates it from not so different other documents? No. Latent Dirichlet allocation, discussed soon, does a better job here.

问题5
Question 5

我们能否改进潜在语义分析,以便在低维主题空间中将紧密的文档向量保持在一起?是的,我们可以通过利用文档的额外信息或元数据(例如具有相同发件人的消息)来引导向量,或者通过使用成本函数进行惩罚,以便该方法溢出也保持密切性的主题向量。

Can we improve latent semantic analysis to also keep close document vectors together in the lower-dimensional topic space? Yes, we can steer the vectors by taking advantage of extra information, or metadata, of the documents, such as messages having the same sender, or by penalizing using a cost function so that the method spills out topic vectors that preserve closeness as well.

总而言之,潜在语义分析以最佳方式选择主题,最大限度地提高语料库中主题的多样性。TF-IDF 矩阵奇异值分解得到的矩阵U对我们来说非常重要。它返回方差最大的方向。我们通常会去掉语料库中文档之间差异最小的主题,丢弃U的最后一列。这类似于在文本准备期间手动删除停用词(and、a、the 等),但潜在语义分析以优化的方式为我们做到了这一点。矩阵U 的行数和列数与我们的词汇表相同。它是基于同一文档中单词共现的单词和主题之间的互相关性。当我们将一个新文档乘以U (将其投影到U的列上)时,我们将得到文档中每个主题的数量。我们可以根据需要截断U并丢弃不太重要的主题,将维度减少到尽可能少的主题我们想要。

To summarize, latent semantic analysis chooses the topics in an optimal way that maximizes the diversity in the topics across the corpus. The matrix U from the singular value decomposition of the TF-IDF matrix is very important for us. It returns the directions along which the variance is maximal. We usually get rid of the topics that have the least amount of variance between the documents in the corpus, throwing away the last columns of U. This is similar to manually getting rid of stop words (and, a, the, etc.) during text preparation, but latent semantic analysis does that for us in an optimized way. The matrix U has the same number of rows and columns as our vocabulary. It is the cross-correlation between words and topics based on word co-occurrence in the same document. When we multiply a new document by U (project it onto the columns of U), we would get the amount of each topic in the document. We can truncate U as we wish and throw away less important topics, reducing the dimension to as few topics as we want.

潜在语义分析的缺点

Shortcomings of latent semantic analysis

它生成的主题空间或U的列仅仅是标记的线性组合,这些标记以尽可能捕获词汇表标记的使用差异的方式组合在一起。这并不一定会转化为对人类有意义的单词组合。真糟糕。稍后讨论的 Word2vec 解决了这些缺点。

The topic spaces it produces, or the columns of U, are mere linear combinations of tokens that are thrown together in a way that captures as much variance in the usage across the vocabulary’s tokens as possible. This doesn’t necessarily translate into word combinations that are in any way meaningful to humans. Bummer. Word2vec, discussed later, addresses these shortcomings.

最后,通过潜在语义分析生成的主题向量只是对 TF-IDF 向量执行的线性变换。它们应该是语义搜索、文档聚类和基于内容的推荐引擎的首选。所有这些都可以通过测量这些主题向量之间的距离来完成,我们稍后将对此进行解释这一章。

Finally, the topic vectors produced via latent semantic analysis are just linear transformations performed on the TF-IDF vectors. They should be the first choice for semantic searches, clustering documents, and content-based recommendation engines. All of this can be accomplished by measuring distances between these topic vectors, which we explain later in this chapter.

由潜在狄利克雷分配确定的文档的主题向量表示

Topic Vector Representation of a Document Determined by Latent Dirichlet Allocation

不像使用潜在语义分析和潜在狄利克雷分配(LDA)来生成主题向量,如果我们向语料库添加新文档以生成其主题向量,则必须重新处理整个语料库。此外,我们使用非线性统计方法将单词捆绑到主题中:我们假设词频服从狄利克雷分布。这使得该方法在为主题分配单词的统计方面比潜在语义分析更加精确。因此,该方法是可以解释的:根据单词在文档中一起出现的频率,将单词分配给主题的方式,以及将主题分配给文档的方式,对于我们人类来说往往是有意义的。

Unlike topic vectors using latent semantic analysis, with latent Dirichlet allocation (LDA) we do have to reprocess the entire corpus if we add a new document to the corpus to produce its topic vector. Moreover, we use a nonlinear statistical approach to bundle words into topics: we assume a Dirichlet distribution of word frequencies. This makes the method more precise than latent semantic analysis in terms of the statistics of allocating words to topics. Thus, the method is explainable: the way words are allocated to topics, based on how often they occurred together in a document, and the way topics are allocated to documents, tend to make sense to us as humans.

这种非线性方法比线性潜在语义分析需要更长的训练时间。因此,尽管它是可以解释的,但对于涉及文档集的应用来说是不切实际的。我们可以用它来总结单个文档,其中文档中的每个句子成为它自己的文档,而母文档成为语料库。

This nonlinear method takes longer to train than the linear latent semantic analysis. For this reason it is impractical for applications involving corpuses of documents, even though it is explainable. We can use it instead for summarizing single documents, where each sentence in the document becomes its own document, and the mother document becomes the corpus.

LDA 于 2000 年由遗传学家发明,用于推断种群结构,并于 2003 年用于自然语言处理。以下是其假设:

LDA was invented in 2000 by geneticists for the purpose of inferring population structure, and adopted in 2003 for natural language processing. The following are its assumptions:

  • 我们从原始单词计数(而不是标准化的 TF-IDF 向量)开始,但仍然没有单词排序来理解它们。相反,我们仍然依赖于对每个文档的单词统计数据进行建模,只不过这次我们将单词分布明确地合并到模型中。

  • We start with raw word counts (rather than normalized TF-IDF vectors), but there is still no sequencing of words to make sense of them. Instead, we still rely on modeling the statistics of words for each document, except this time we incorporate the word distribution explicitly into the model.

  • 文档是任意数量主题的线性组合(提前指定此数量,以便该方法将文档的标记分配给此数量的主题)。

  • A document is a linear combination of an arbitrary number of topics (specify this number ahead of time so that the method allocates the document’s tokens to this number of topics).

  • 我们可以根据词频通过一定的单词分布来表示每个主题。

  • We can represent each topic by a certain distribution of words based on their term frequencies.

  • 文档中某个主题出现的概率遵循狄利克雷概率分布。

  • The probability of occurrence of a certain topic in a document follows a Dirichlet probability distribution.

  • 某个单词被分配给某个主题的概率也遵循狄利克雷概率分布。

  • The probability of a certain word being assigned to a topic also follows a Dirichlet probability distribution.

因此,使用狄利克雷分配获得的主题向量是稀疏的,这表明主题之间从它们包含的单词的意义上来说是干净的分离,这使得它们可以解释。

As a result, topic vectors obtained using Dirichlet allocation are sparse, indicating clean separation between the topics in the sense of which words they contain, which makes them explainable.

通过狄利克雷分配,经常一起出现的单词被分配给相同的主题。因此,当我们移动到较低维度的主题空间时,这种方法可以保持紧密相连的标记。另一方面,当我们移动到较低维度的主题空间时,潜在语义分析会保持分散的标记,因此这对于分类问题来说更好,即使我们移动到低维空间。

With Dirichlet allocation, words that occur frequently together are assigned to the same topics. So this method keeps tokens that were close together, close together, when we move to the lower-dimensional topic space. Latent semantic analysis, on the other hand, keeps tokens that were spread apart, spread apart, when we move to the lower-dimensional topic space, so this is better for classification problems where the separation between the classes is maintained even as we move to the lower-dimensional space.

通过潜在判别分析确定的文档的主题向量表示

Topic Vector Representation of a Document Determined by Latent Discriminant Analysis

不像潜在语义分析和潜在狄利克雷分配,将文档分解为我们选择的多个主题,潜在判别分析将文档分解为仅一个主题,例如垃圾邮件、情感等。这对于二元分类很有用,例如将邮件分类为垃圾邮件或非垃圾邮件,或将评论分类为正面或负面。与潜在语义分析不同的是,潜在判别分析仅最大化属于每个类的向量质心之间的分离,而最大化新主题空间中所有向量之间的分离。

Unlike latent semantic analysis and latent Dirichlet allocation, which break down a document into as many topics as we choose, latent discriminant analysis breaks down a document into only one topic, such as spamness, sentiment, etc. This is good for binary classification, such as classifying messages as spam or nonspam, or classifying reviews as positive or negative. Rather than what latent semantic analysis does, maximizing the separation between all the vectors in the new topic space, latent discriminant analysis maximizes the separation only between the centroids of the vectors belonging to each class.

但是我们如何确定代表这一主题的向量呢?给定标记的垃圾邮件和非垃圾邮件文档的 TF-IDF 向量,我们计算每个类别的质心,然后我们的向量沿着连接两个质心的线(见图7-1)。

But how do we determine the vector representing this one topic? Given the TF-IDF vectors of labeled spam and nonspam documents, we compute the centroid of each class, then our vector is along the line connecting the two centroids (see Figure 7-1).

埃麦0701
图 7-1。潜在判别分析

现在,每个新文档都可以投影到这一维度上。我们的文档沿该线的坐标是其 TF-IDF 与质心线的方向向量之间的点积。整个文档(具有数百万个维度)现在被沿着一个维度(一个轴)压缩为一个数字,该维度带有两个质心及其中点。然后,我们可以根据文档与沿该线的每个质心的距离将文档分类为属于一类或另一类。请注意,使用此方法分离类的决策边界是线性。

Each new document can now be projected onto this one dimension. The coordinate of our document along that line is the dot product between its TF-IDF and the direction vector of the centroids line. The whole document (with millions of dimensions) is now squashed into one number along one dimension (one axis) that carries the two centroids along with their midpoint. We can then classify the document as belonging to one class or the other depending on its distance from each centroid along that one line. Note that the decision boundary for separating classes using this method is linear.

由神经网络嵌入确定的单词和文档的含义向量表示

Meaning Vector Representations of Words and of Documents Determined by Neural Network Embeddings

以前的自然语言文本文档向量化模型仅考虑单词之间的线性关系,或者在潜在狄利克雷分配中,我们必须使用人类判断来选择模型参数并提取特征。我们现在知道,神经网络的强大之处在于它们能够捕获非线性关系、提取特征并自动找到合适的模型参数。我们现在将使用神经网络来创建代表单个单词和术语的向量,并且我们将使用类似的方法来创建代表整个段落含义的向量。由于这些向量编码了每个术语的含义以及逻辑和上下文用法,因此我们可以简单地通过进行通常的向量加法和减法来对它们进行推理。

The previous models for vectorizing documents of natural language text only considered linear relationships between words, or in latent Dirichlet allocation, we had to use human judgment to select the model’s parameters and extract features. We now know that the power of neural networks lies in their ability to capture nonlinear relationships, extract features, and find appropriate model parameters automatically. We will now use neural networks to create vectors that represent individual words and terms, and we will employ similar methods to create vectors representing the meanings of entire paragraphs. Since these vectors encode the meaning and the logical and contextual usage of each term, we can reason with them simply by doing the usual vector additions and subtractions.

通过合并连续性属性来表示各个术语的 Word2vec 向量表示

Word2vec vector representation of individual terms by incorporating continuous-ness attributes

经过使用 TF 向量或 TF-IDF 向量作为主题向量模型的起点,我们忽略了单词的附近上下文及其对其含义的影响。词向量解决了这个问题。词向量是单词含义的数值向量表示,因此语料库中的每个术语都成为语义向量。这种具有单个单词的浮点数条目的向量表示可以实现语义查询和逻辑推理。

By using TF vectors or TF-IDF vectors as a starting point for our topic vector models, we have ignored the nearby context of words and the effect that has on their meanings. Word vectors solve this problem. A word vector is a numerical vector representation of a word’s meaning, so every single term in the corpus becomes a vector of semantics. This vector representation with floating-point number entries of single words enables semantic queries and logical reasoning.

使用神经网络学习词向量表示。它们通常有 100 到 500 个维度,用于编码单词中每个含义维度的含量。训练词向量模型时,文本数据是未标记的。经过训练后,通过一些接近度指标比较两个术语的向量,可以确定两个术语在含义上接近或相距较远。接下来讨论的余弦相似度是首选方法。

Word vector representations are learned using a neural network. They usually have 100 to 500 dimensions encoding how much of each meaning dimension a word carries within it. When training a word vector model, the text data is unlabeled. Once trained, two terms can be determined to be close in meaning or far apart by comparing their vectors via some closeness metrics. Cosine similarity, discussed next, is the go-to method.

2013 年,Google 创建了一个单词到向量模型 word2vec,并在包含 1000 亿个单词的 Google News feed 上进行训练。生成的预训练 word2vec 模型包含 300 万个单词和短语的 300 维向量。它可以在word2vec 项目的 Google Code Archive 页面上免费下载。

In 2013, Google created a word-to-vector model, word2vec, that it trained on the Google News feed containing 100 billion words. The resulting pre-trained word2vec model contains 300 dimensional vectors for 3 million words and phrases. It is freely available to download at the Google Code Archive page for the word2vec project.

word2vec 构建的向量比本章前面讨论的主题向量更能捕获单词的含义。“Efficient Estimation of Word Representations in Vector Space”(Mikolov 等人,2013)论文的摘要内容丰富:

The vector that word2vec builds up captures much more of a word’s meaning than the topic vectors discussed earlier in this chapter. The abstract of the paper “Efficient Estimation of Word Representations in Vector Space” (Mikolov et al. 2013) is informative:

我们提出了两种新颖的模型架构,用于计算来自非常大的数据集的单词的连续向量表示。这些表示的质量是在单词相似性任务中测量的,并将结果与​​之前基于不同类型神经网络的最佳性能技术进行比较。我们观察到以低得多的计算成本实现了准确性的大幅提高,即从 16 亿个单词数据集中学习高质量的单词向量只需不到一天的时间。此外,我们表明这些向量在我们的测试集上提供了最先进的性能,用于测量句法和语义单词相似性。

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

一个月后,论文“单词和短语的分布式表示及其组合性”(Mikolov et al. 2013)讨论了单词短语的表示,其含义与其各个组成部分不同,例如“加拿大航空”:

A month later, the paper “Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013) addressed the representation of word phrases that mean something different than their individual components, such as “Air Canada”:

最近推出的连续 Skip-gram 模型是一种学习高质量分布式向量表示的有效方法,可以捕获大量精确的句法和语义词关系。在本文中,我们提出了几种可以提高向量质量和训练速度的扩展。通过对频繁出现的单词进行二次采样,我们获得了显着的加速,并且还学习了更多常规单词表示。我们还描述了分层 softmax 的一种简单替代方案,称为负采样。单词表示的一个固有限制是它们对词序漠不关心并且无法表示惯用短语。例如,“加拿大”和“航空”的含义不能轻易组合成“加拿大航空”。受此示例的启发,我们提出了一种在文本中查找短语的简单方法,并表明学习数百万个短语的良好向量表示是可能的。

The recently introduced continuous Skip-gram model is an efficient method for learning high quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

介绍了 word2vec 表示的出版物“连续空间词表示中的语言规律”(Mikolov 等人,2013 年)演示了这些单词的含义向量如何编码逻辑规律,以及这如何使我们能够回答常规类比问题:

The publication that introduced word2vec representations, “Linguistic Regularities in Continuous Space Word Representations” (Mikolov et al. 2013), demonstrates how these meaning vectors for words encode logical regularities and how this enables us to answer regular analogy questions:

连续空间语言模型最近在各种任务中表现出了出色的结果。在本文中,我们研究了由输入层权重隐式学习的向量空间单词表示。我们发现这些表示非常擅长捕获语言中的句法和语义规律,并且每个关系都以关系特定的向量偏移为特征。这允许基于单词之间的偏移量进行面向向量的推理。例如,男性/女性关系是自动学习的,并且通过诱导向量表示,“国王 - 男人 + 女人”会产生非常接近“女王”的向量。我们证明词向量通过句法类比问题(本文提供)捕获句法规律,并且能够正确回答近 40% 的问题。我们通过使用向量偏移方法回答 SemEval-2012 任务 2 问题来证明词向量捕获语义规律。值得注意的是,这种方法优于以前最好的系统。

Continuous space language models have recently demonstrated outstanding results across a variety of tasks. In this paper, we examine the vector space word representations that are implicitly learned by the input layer weights. We find that these representations are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation specific vector offset. This allows vector oriented reasoning based on the offsets between words. For example, the male/female relationship is automatically learned, and with the induced vector representations, “King - Man + Woman” results in a vector very close to “Queen.” We demonstrate that the word vectors capture syntactic regularities by means of syntactic analogy questions (provided with this paper), and are able to correctly answer almost 40% of the questions. We demonstrate that the word vectors capture semantic regularities by using the vector offset method to answer SemEval-2012 Task 2 questions. Remarkably, this method outperforms the best previous systems.

自 2013 年以来,通过在更大的语料库上进行训练,word2vec 的性能得到了显着提高。

The performance of word2vec has improved dramatically since 2013 by training it on much larger corpuses.

Word2vec 接受一个单词并为其分配一个属性向量,例如地方性、动物性、城市性、积极性(情感)、亮度、性别等。每个属性都是一个维度,捕获该属性的多少该词的含义包含。

Word2vec takes one word and assigns to it a vector of attributes, such as place-ness, animal-ness, city-ness, positivity (sentiment), brightness, gender, etc. Each attribute is a dimension, capturing how much of the attribute the meaning of the word contains.

这些词义向量和属性不是手动编码的,而是在训练过程中进行编码的,模型从它所保留的公司中学习单词的含义:同一个句子中五个左右的邻近单词。这与潜在语义分析不同,潜在语义分析仅从同一文档中出现的单词中学习主题,不一定彼此接近。对于涉及短文档和语句的应用,word2vec 嵌入实际上已经取代了通过潜在语义分析获得的主题向量。我们还可以通过在词向量表示之上执行 k 均值聚类,使用词向量从庞大的数据集中导出词簇。有关更多信息,请参阅word2vec 项目的 Google 代码存档页面。

These word meaning vectors and the attributes are not encoded manually, but during training, where the model learns the meaning of a word from the company it keeps: the five or so nearby words in the same sentence. This is different from latent semantic analysis, where the topics are learned only from words occurring in the same document, not necessarily close to each other. For applications involving short documents and statements, word2vec embeddings have actually replaced topic vectors obtained through latent semantic analysis. We can also use word vectors to derive word clusters from huge data sets by performing k-means clustering on top of the word vector representations. See the Google Code Archive page for the word2vec project for more information.

通过向量表示具有某种意义(而不是计算某种东西)的单词的优点是我们可以用它们进行推理。例如,如前所述,如果我们从表示“国王”的向量中减去表示“男人”的向量,并加上表示“女人”的向量,那么我们得到的向量非常接近表示“女王”一词的向量。另一个例子是捕捉单数和复数单词之间的关系。如果我们从表示单词复数形式的向量中减去表示单词单数形式的向量,我们将获得所有单词大致相同的向量。

The advantage of representing words through vectors that mean something (rather than count something) is that we can reason with them. For example, as mentioned earlier, if we subtract the vector representing “man” from the vector representing “king” and add the vector representing “woman,” then we get a vector very close to the vector representing the word “queen.” Another example is capturing the relationship between singular and plural words. If we subtract vectors representing the singular form of words from vectors representing their plural forms, we obtain vectors that are roughly the same for all words.

接下来的问题是:我们如何计算 word2vec 嵌入?也就是说,我们如何训练 word2vec 模型?什么是训练数据、神经网络的架构及其输入和输出?训练 word2vec 模型的神经网络很浅,只有一个隐藏层。输入是一个大型文本语料库,输出是数百维的向量,一个向量对应语料库中的每个唯一术语。共享共同语言上下文的单词最终会得到彼此接近的向量。

The next questions are: how do we compute word2vec embeddings? That is, how do we train a word2vec model? What are training data, the neural network’s architecture, and its input and output? The neural networks that train word2vec models are shallow with only one hidden layer. The input is a large corpus of text, and the outputs are vectors of several hundred dimensions, one for each unique term in the corpus. Words that share common linguistic contexts end up with vectors that are close to each other.

word2vec有两种学习算法,这里不再详述。然而,到目前为止,我们对神经网络的工作原理有了很好的了解,尤其是只有一个隐藏层的浅层神经网络。这两种学习算法是:

There are two learning algorithms for word2vec, which we will not detail here. However, by now we have a very good idea of how neural networks work, especially shallow ones with only one hidden layer. The two learning algorithms are:

连续词袋
Continuous bag-of-words

这可以从周围上下文单词的窗口中预测当前单词;上下文单词的顺序不影响预测。

This predicts the current word from a window of surrounding context words; the order of the context words does not influence the prediction.

连续跳语法
Continuous skip-gram

这使用当前单词来预测上下文单词的周围窗口;该算法对附近的上下文单词的权重比对较远的上下文单词的权重更大。

This uses the current word to predict the surrounding window of context words; the algorithm weighs nearby context words more heavily than more distant context words.

两种算法都学习一个术语的向量表示,这对于预测句子中的其他术语很有用。连续的 bag-of-words 显然比连续的skip-gram更快,而skip-gram更适合不频繁的单词。

Both algorithms learn the vector representation of a term that is useful for prediction of other terms in a sentence. Continuous bag-of-words is apparently faster than continuous skip-gram, while skip-gram is better for infrequent words.

有关更多详细信息,请参阅教程“词向量的惊人力量”(Colyer 2016)word2vec 上的维基百科页面以及有关该主题的三篇原始论文:“向量空间中词表示的有效估计”(Mikolov 等人) . 2013)“单词和短语的分布式表示及其组合性”(Mikolov 等人,2013 年)“连续空间单词表示中的语言规律”(Mikolov 等人,2013 年)

For more details, refer to the tutorial “The Amazing Power of Word Vectors” (Colyer 2016), the Wikipedia page on word2vec, and the three original papers on the subject: “Efficient Estimation of Word Representations in Vector Space” (Mikolov et al. 2013), “Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013), and “Linguistic Regularities in Continuous Space Word Representations” (Mikolov et al. 2013).

经过训练的 Google 新闻 word2vec 模型有 300 万个单词,每个单词用 300 维的向量表示。要下载此文件,您需要 3 GB 可用内存。如果我们的内存有限或者只关心一小部分,可以通过多种方式下载整个预训练模型的话。

The trained Google News word2vec model has 3 million words, each represented with a vector of 300 dimensions. To download this, you would need 3 GB of available memory. There are ways around downloading the whole pre-trained model if we have limited memory or if we only care about a fraction of the words.

如何可视化表示单词的向量

How to visualize vectors representing words

词向量是非常高维的(100-500维),但人类只能可视化二维和三维向量,因此我们需要将高维向量投影到这些低维空间上,并仍然保留它们最基本的特征。现在我们知道奇异值分解(主成分分析)为我们完成了这一任务,为我们提供了按重要性递减顺序投影的向量,或者给定的词向量集合变化最大的方向。也就是说,奇异值分解确保该投影提供单词向量的最佳可能视图,使它们尽可能远离。

Word vectors are very high-dimensional (100–500 dimensions), but humans can only visualize two- and three-dimensional vectors, so we need to project our high-dimensional vectors onto these drastically lower-dimensional spaces and still retain their most essential characteristics. By now we know that the singular value decomposition (principal component analysis) accomplishes that for us, giving us the vectors along which to project in decreasing order of importance, or the directions along which a given collection of word vectors varies the most. That is, the singular value decomposition ensures that this projection gives the best possible view of the word vectors, keeping them as far apart as possible.

网络上有很多很好的例子。在出版物“MOOC(大规模开放在线课程)视频讲座数据集的词嵌入主题分布向量”(Kastrati et al. 2020)中,作者使用了来自教育领域的数据集,其中包含 200 门课程的 12,032 个视频讲座的成绩单从Coursera收集来生成两个东西:使用 word2vec 模型的词向量,以及使用潜在狄利克雷分配的文档主题向量。该数据集有 878,000 个句子和超过 79,000,000 个标记。词汇量超过 68,000 个独特单词。各个视频脚本的长度不同,从 228 到 32,767 个令牌不等,每个视频脚本平均有 6,622 个令牌。作者在 Python 的 Gensim 包中使用了 word2vec 和潜在 Dirichlet 分配实现。图 7-2显示了该出版物使用主成分分析对词向量子集进行的三维可视化。

There are many nice examples on the web. In the publication “Word Embedding-Topic Distribution Vectors for MOOC (Massive Open Online Courses) video lectures dataset” (Kastrati et al. 2020), the authors use a data set from the education domain with the transcripts of 12,032 video lectures from 200 courses collected from Coursera to generate two things: word vectors using the word2vec model, and document topic vectors using latent Dirichlet allocation. The data set has 878,000 sentences and more than 79,000,000 tokens. The vocabulary size is over 68,000 unique words. The individual video transcripts are of different lengths, varying from 228 to 32,767 tokens, with an average of 6,622 tokens per video transcript. The authors use word2vec and latent Dirichlet allocation implementations in the Gensim package in Python. Figure 7-2 shows the publication’s three-dimensional visualization of a subset of the word vectors using principal component analysis.

请注意,词向量和文档主题向量本身并不是目的。相反,它们是达到目的的手段,这通常是自然语言处理任务,例如:特定领域内的分类(例如示例中的大规模开放在线课程)、现有模型和新模型的基准测试和性能分析、迁移学习、推荐系统、上下文分析、主题短文本丰富以及个性化学习、组织易于搜索和最大可见性的内容。我们很快就会访问此类任务。

Note that word vectors and document topic vectors are not an end unto themselves; instead they are a means to an end, which is usually a natural language processing task, such as: classification within specific domains (such as the massive open online courses in the example), benchmarking and performance analysis of existing and new models, transfer learning, recommendation systems, contextual analysis, short text enrichment with topics, and personalized learning, organizing content that is easy to search and for maximum visibility. We visit such tasks shortly.

埃麦0702
图 7-2。使用前三个主成分的词向量的三维可视化。此示例突出显示了代表单词“学习”及其邻居的向量:学术、研究、机构、阅读等图片来源)。

Facebook 的 fastText 向量表示单个 n 字符克

Facebook’s fastText vector representation of individual n-character grams

脸书的fastText 与 word2vec 类似,但它不是将完整单词或 n-gram 表示为向量,而是经过训练以输出每个n-character gram的向量表示。这使得 fastText 能够处理罕见的、拼写错误的甚至部分单词,例如社交媒体帖子中经常出现的单词。在训练过程中,word2vec 的skip-gram 算法学习预测给定单词的周围上下文。同样,fastText 的 n 字符语法算法可以学习预测单词周围的 n 字符语法,从而提供更大的粒度和灵活性。例如,它不仅将完整的单词“lovely”表示为向量,还将 2-gram 和 3-gram 表示为向量:lo、lov、ov、ove、ve、vel、el、ely 和 ly 。

Facebook’s fastText is similar to word2vec, but instead of representing full words or n-grams as vectors, it is trained to output a vector representation for every n-character gram. This enables fastText to handle rare, misspelled, and even partial words, such as the ones frequently appearing in social media posts. During training, word2vec’s skip-gram algorithm learns to predict the surrounding context of a given word. Similarly, fastText’s n-character gram algorithm learns to predict a word’s surrounding n-character grams, providing more granularity and flexibility. For example, instead of only representing the full word “lovely” as a vector, it will represent the 2- and 3-grams as vectors as well: lo, lov, ov, ove, ve, vel, el, ely, and ly.

Facebook 发布了针对 294 种语言的预训练 fastText 模型,并在这些语言的可用维基百科集合上进行了训练。这些语言的范围从阿布哈兹语到祖鲁语,其中包括只有少数人使用的稀有语言。当然,发布的模型的准确性因语言而异,并且取决于训练数据的可用性和质量。

Facebook released its pre-trained fastText models for 294 languages, trained on available Wikipedia collections for these languages. These range from Abkhazian to Zulu, and include rare languages spoken by only a handful of people. Of course, the accuracy of the released models varies across languages and depends on the availability and quality of the training data.

文档的 Doc2vec 或 par2vec 向量表示

Doc2vec or par2vec vector representation of a document

怎么样从语义上表示文档?在前面的部分中,我们能够将整个文档表示为主题向量,但 word2vec 只能将单个单词或短语表示为向量。那么我们能否扩展 word2vec 模型以将整个文档表示为带有含义的向量?论文“Distributed Representations of Sentences and Documents”(Le et al. 2014)正是这样做的,它使用一种无​​监督算法从可变长度的文本片段(例如句子、段落和文档)中学习固定长度的密集向量。教程“使用 Gensim 的 Doc2Vec 教程”(Klintberg 2015)逐步介绍了 Python 实现过程,为给定语料库中的每个完整文档生成固定大小的向量。

How about representing documents semantically? In previous sections we were able to represent entire documents as topic vectors, but word2vec only represents individual words or phrases as vectors. Can we then extend the word2vec model to represent entire documents as vectors carrying meaning? The paper “Distributed Representations of Sentences and Documents” (Le et al. 2014) does exactly that, with an unsupervised algorithm that learns fixed-length dense vectors from variable-length pieces of texts, such as sentences, paragraphs, and documents. The tutorial “Doc2Vec tutorial using Gensim” (Klintberg 2015) walks through the Python implementation process, producing a fixed-size vector for each full document in a given corpus.

全局向量或单词的向量表示

Global vector or vector representation of words

其他产生表示单词含义的向量的方法。全局向量或 GloVe (2014)是使用奇异值分解获得此类向量的模型。它仅针对全局词-词共现矩阵的非零条目进行训练,该矩阵列出了整个语料库中单词相互共现的频率。

There are other ways to produce vectors representing meanings of words. Global vector or GloVe (2014) is a model that obtains such vectors using the singular value decomposition. It is trained only on the nonzero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another across an entire corpus.

GloVe 本质上是一个对数双线性模型加权最小二乘目标。对数双线性模型可能是最简单的神经语言模型。给定前面的n-1 个单词,对数双线性模型只需线性组合这些前面的n-1 个单词的向量表示即可计算下一个单词的初始向量表示。然后,根据计算线性组合向量表示与词汇表中所有单词的表示之间的相似度(点积),计算给定这n-1 个前面的单词,下一个单词出现的概率:

GloVe is essentially a log-bilinear model with a weighted least-squares objective. The log-bilinear model is perhaps the simplest neural language model. Given the preceding n-1 words, the log-bilinear model computes an initial vector representation for the next word simply by linearly combining the vector representations of these preceding n-1 words. Then the probability of the occurrence of the next word given those n-1 preceding words is computed based on computing the similarity (dot product) between the linear combination vector representation and the representations of all words in the vocabulary:

r w n = w | w 1 , w 2 , , w n-1 = eXpw vCA t w eXpw vCA 1 t w+eXpw vCA 2 t w++eXpw vCA vCA-sze t w

全局向量模型的主要直觉是简单的观察,即词与词共现概率的比率可能编码某种形式的含义。GloVe 项目网站上的示例考虑了目标词“ice”和“steam”与词汇表中各种探测词的共现概率。表 7-1显示了 60 亿字语料库的实际概率。

The main intuition underlying the Global Vector model is the simple observation that ratios of word-word co-occurrence probabilities potentially encode some form of meaning. The example on the GloVe project website considers the co-occurrence probabilities for the target words “ice” and “steam” with various probe words from the vocabulary. Table 7-1 shows the actual probabilities from a six-billion-word corpus.

表 7-1。显示单词“冰”和“蒸汽”与单词“固体”、“气体”、“水”和“时尚”出现的概率表
概率和比率 k = 固体 k = 气体 k = 水 k =时尚

P ( k |冰)

P (k|ice)

1.9× 10-4

1.9 × 10–4

6.6× 10-5

6.6 × 10–5

3.0× 10-3

3.0 × 10–3

1.7× 10-5

1.7 × 10–5

P ( k |蒸汽)

P (k|steam)

2.2× 10-5

2.2 × 10–5

7.8× 10-4

7.8 × 10–4

2.2× 10-3

2.2 × 10–3

1.8× 10-5

1.8 × 10–5

P ( k |冰)/P ( k |蒸汽)

P (k|ice)/P (k|steam)

8.9

8.9

8.5× 10-2

8.5 × 10–2

1.36

1.36

0.96

0.96

观察表 7-1中的表格,我们注意到,正如预期的那样,单词ice与单词solid同时出现的频率高于与单词gas的频率,而单词steam与单词gas同时出现的频率高于与单词gas的频率。它与“固体”一词有关。冰和蒸汽都经常与它们共有的属性同时出现,并且两者都很少与不相关的词时尚同时出现。计算概率比可以消除来自非歧视性词(如水)的噪音因此大值(远大于 1)与Ice的特定属性很好地相关,而小值(远小于 1)与steam的特定属性很好地相关。这样,概率比就编码了与热力学相的抽象概念相关的某种粗略的意义形式。

Observing the table in Table 7-1, we notice that, as expected, the word ice co-occurs more frequently with the word solid than it does with the word gas, whereas the word steam co-occurs more frequently with the word gas than it does with the word solid. Both ice and steam co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently. Calculating the ratio of probabilities cancels out the noise from nondiscriminative words like water, so that large values, much greater than 1, correlate well with properties specific to ice, and small values, much less than 1, correlate well with properties specific to steam. This way, the ratio of probabilities encodes some crude form of meaning associated with the abstract concept of the thermodynamic phase.

GloVe 的训练目标是学习词向量,使其点积等于词共现概率的对数。由于比率的对数等于对数之差,因此该目标考虑了词向量空间中的向量差异。由于这些比率可以编码某种形式的含义,因此该信息也被编码为向量差异。因此,生成的词向量在词类比任务中表现得非常好,例如在 word2vec 包中讨论的任务。

The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the probability of co-occurrence of words. Since the logarithm of a ratio is equal to the difference of logarithms, this objective considers vector differences in the word vector space. Because these ratios can encode some form of meaning, this information gets encoded as vector differences as well. For this reason, the resulting word vectors perform very well on word analogy tasks, such as those discussed in the word2vec package.

由于奇异值分解算法已经优化了数十年,因此 GloVe 在训练方面比 word2vec 具有优势,word2vec 是一种神经网络,依靠梯度下降和反向传播来执行误差最小化。如果从我们关心的某个语料库中训练我们自己的词向量,我们可能最好使用全局向量模型而不是 word2vec,尽管 word2vec 是第一个用单词完成语义和逻辑推理的模型,因为全局向量训练速度更快,具有比 word2vec 更好的 RAM 和 CPU 效率,并且给出更准确的结果,即使在较小的语料库。

Since singular value decomposition algorithms have been optimized for decades, GloVe has an advantage in training over word2vec, which is a neural network and relies on gradient descent and backpropagation to perform its error minimization. If training our own word vectors from a certain corpus that we care about, we are probably better off using a Global Vector model than word2vec, even though word2vec is the first to accomplish semantic and logical reasoning with words, since Global Vector trains faster, has better RAM and CPU efficiency, and gives more accurate results than word2vec, even on smaller corpuses.

余弦相似度

Cosine Similarity

迄今为止在本章中,我们只致力于一个目标:将自然语言文本文档转换为数字向量。我们的文档可以是一个词、一句话、一个段落、多个段落,甚至更长。我们发现了多种获取向量的方法——其中一些比其他向量在语义上更能代表我们的文档。

So far in this chapter we have worked toward one goal only: convert a document of natural language text into a vector of numbers. Our document can be one word, one sentence, a paragraph, multiple paragraphs, or longer. We discovered multiple ways to get our vectors—some are more semantically representative of our documents than others.

一旦我们有了文档的向量表示,我们就可以将其输入机器学习模型,例如分类算法、聚类算法或其他算法。一个示例是使用某种聚类算法(例如k-means)对语料库的文档向量进行聚类,以创建文档分类器。对于搜索引擎、信息检索系统和其他应用程序,我们还可以确定我们的文档与其他文档在语义上的相似程度。

Once we have a document’s vector representation, we can feed it into machine learning models, such as classification algorithms, clustering algorithms, or others. One example is to cluster the document vectors of a corpus with some clustering algorithm such as k-means to create a document classifier. We can also determine how semantically similar our document is to other documents, for search engines, information retrieval systems, and other applications.

我们已经确定,由于维数灾难,测量两个非常高维的文档向量之间的欧几里德距离是没有用的,因为它们会相距极远,这只是因为它们所处的空间非常广阔。那么我们如何确定表示文档的向量是接近还是遥远,或者相似还是不同呢?一种成功的方法是使用余弦相似度,测量两个文档向量之间角度的余弦。这是由向量的点积给出的,每个向量都按其长度进行归一化(如果我们提前对文档向量进行归一化,那么它们的长度就已经是 1):

We have established that due to the curse of dimensionality, measuring the Euclidean distance between two very high-dimensional document vectors is useless, since they would come out extremely far apart, only because of the vastness of the space they inhabit. So how do we determine whether vectors representing documents are close or far, or similar or different? One successful way is to use cosine similarity, measuring the cosine of the angle between the two document vectors. This is given by the dot product of the vectors, each normalized by its length (had we normalized the document vectors ahead of time, then their lengths would have already been one):

因斯 角度 之间 dC 1 dC 2 = dC 1 t enGtHdC 1 dC 2 enGtHdC 2

图 7-3显示了二维向量空间中表示的三个文档。我们关心它们之间的角度。

Figure 7-3 shows three documents represented in a two-dimensional vector space. We care about the angles between them.

埃麦0703
图 7-3。二维向量空间中表示的三个文档

角度的余弦值始终是 -1 到 1 之间的数字。当两个文档向量完全对齐并沿所有维度指向同一方向时,它们的余弦相似度为 1;当它们在每个维度上都完全相反时,它们的余弦相似度为-1;当它们彼此正交时,它们的余弦相似度是 0。

The cosine of an angle is always a number between -1 and 1. When two document vectors are perfectly aligned and pointing in the same direction along all the dimensions, their cosine similarity is 1; when they are perfect opposites of each other regarding every single dimension, their cosine similarity is -1; and when they are orthogonal to each other, their cosine similarity is 0.

自然语言处理应用

Natural Language Processing Applications

散装本章的内容是关于将给定的自然语言文本文档转换为数字向量。我们已经确定,有多种方法可以获取文档向量,所有这些都会导致不同的表示(从而得出结论),或者强调给定自然语言数据的某些方面而不是其他方面。对于进入人工智能自然语言处理子领域的人来说,这是最难克服的障碍之一,特别是如果他们来自定量背景,他们所使用的实体本质上是数字的,适合数学建模和分析。现在我们已经克服了这一障碍,配备了自然语言数据的具体向量表示,我们可以从数学角度思考流行的应用程序。重要的是要意识到有多种方法可以完成以下各项。传统方法是硬编码规则,为单词、标点符号、表情符号等分配分数,然后依赖数据样本中这些内容的存在来生成结果。现代方法依赖于各种机器学习模型,而机器学习模型又依赖于(大部分)标记的训练数据集。为了在这个领域脱颖而出,我们必须留出时间,在同一任务上尝试不同的模型,比较性能,并深入了解每个模型及其优点、缺点以及其成功和失败的数学理由。

The bulk of this chapter has been about converting a given document of natural language text to a vector of numbers. We have established that there are multiple ways to get our document vectors, all leading to varying representations (and hence conclusions), or emphasizing certain aspects of the given natural language data over others. For people entering the natural language processing subfield of AI, this is one of the hardest barriers to overcome, especially if they are from a quantitative background, where the entities they work with are inherently numerical, ripe for mathematical modeling and analysis. Now that we have overcome this barrier, equipped with concrete vector representations for natural language data, we can think mathematically about popular applications. It is important to be aware that there are multiple ways to accomplish each of the following. Traditional approaches are hardcoded rules, assigning scores to words, punctuations, emojis, etc., then relying on the existence of these in a data sample to produce a result. Modern approaches rely on various machine learning models, which in turn rely on (mostly) labeled training data sets. To excel in this field, we must set time aside and try different models on the same task, compare performance, and gain an in-depth understanding of each model along with its strengths, weaknesses, and mathematical justifications of its successes and failures.

情绪分析

Sentiment Analysis

以下是从自然语言文本中提取情感的常见方法:

The following are common approaches for extracting sentiment from natural language text:

硬编码规则
Hardcoded rules

一个成功的算法是 VADER,即情感推理效价感知词典。这里的分词器需要正确处理标点符号和表情符号,因为它们传达了很多情感。我们还必须手动编译数千个单词及其情绪分数,而不是让机器自动完成此操作。

A successful algorithm is VADER, or Valence Aware Dictionary for sEntiment Reasoning. A tokenizer here needs to handle punctuation and emojis properly, since these convey a lot of sentiment. We also have to manually compile thousands of words along with their sentiment score, as opposed to having the machine accomplish this automatically.

朴素贝叶斯分类器
Naive Bayes classifier

这是一组基于概率贝叶斯定理的分类算法,即最大似然分类的决策规则。这将在第 11 章中讨论。

This is a set of classifying algorithms based on the Bayes’ Theorem from probability, the decision rule for classification on maximum likelihood. This will be discussed in Chapter 11.

潜在判别分析
Latent discriminant analysis

在里面上一节,我们学习了如何使用潜在判别分析将文档分为两类。回顾一下,我们从标记为两个类的数据开始,然后计算每个类的质心并找到连接它们的方向。我们沿着该方向投影每个新数据实例,并根据它更接近哪个质心对其进行分类。

In the previous section, we learned how to classify documents into two classes using latent discriminant analysis. To recap, we start with the data labeled into two classes, then we compute the centroid of each class and find the direction connecting them. We project each new data instance along that direction, and classify it according to which centroid it falls closer to.

使用潜在语义分析
Using latent semantic analysis

集群使用潜在语义分析形成的文档向量可用于分类。理想情况下,在潜在语义分析主题空间中,正面评论远离负面评论。给定一堆标记为正面或负面的评论,我们首先使用潜在语义分析来计算它们的主题向量。现在,为了对新评论进行分类,我们可以计算其主题向量,然后计算该主题向量与正主题向量和负主题向量的余弦相似度。最后,如果评论与正面主题向量更相似,我们将其分类为正面,如果与负面主题向量更相似,则将评论分类为负面。

Clusters of document vectors formed using latent semantic analysis can be used for classification. Ideally, positive reviews cluster away from negative reviews in latent semantic analysis topic spaces. Given a bunch of reviews labeled positive or negative, we first compute their topic vectors using latent semantic analysis. Now, to classify a new review, we can compute its topic vector, then that topic vector’s cosine similarity with the positive and negative topic vectors. Finally, we classify the review as positive if it is more similar to positive topic vectors, and negative if more similar to negative topic vectors.

变形金刚、卷积神经网络、循环长短期记忆神经网络
Transformers, convolutional neural network, recurrent long short-term memory neural network

所有这些现代机器学习方法都需要将我们的文档以向量形式传递到具有特定架构的神经网络中。我们很快就会花时间研究这些最先进的方法。

All of these modern machine learning methods require passing our document in vector form into a neural network with a certain architecture. We will spend time on these state-of-the-art methods shortly.

垃圾邮件过滤器

Spam Filter

从数学上来说,当文档的情绪是积极的或消极的时,垃圾邮件过滤是与前面讨论的情绪分析类似的分类问题。因此,相同的情感分类方法也适用于垃圾邮件过滤。在所有情况下,我们如何创建文档向量并不重要——我们可以使用它们来预测社交帖子是否是垃圾邮件,预测获得点赞的可能性等。

Mathematically, spam filtering is a similar classification problem to sentiment analysis discussed previously, when the sentiment of a document is either positive or negative. Thus, the same methods for sentiment classification apply for spam filtering. In all cases, it doesn’t matter how we create our document vectors–we can use them to predict whether a social post is spam or not spam, predict how likely it is to get likes, etc.

搜索和信息检索

Search and Information Retrieval

再说一遍,不无论我们如何创建表示文档的数值向量,我们都可以将它们用于搜索和信息检索任务。搜索可以基于索引或基于语义。

Again, no matter how we create the numerical vectors representing documents, we can use them for search and information retrieval tasks. The search can be index based or semantic based.

全文搜索
Full text search

当我们根据文档包含的单词或部分单词来搜索文档时。搜索引擎将文档分解为可以索引的单词,类似于我们在教科书末尾找到的索引。当然,拼写错误和拼写错误需要大量跟踪,有时还需要猜测。如果可用的话,指数可以很好地发挥作用。

When we search for a document based on a word or a partial word that it contains. Search engines break documents into words that can be indexed, similar to the indexes we find at the end of textbooks. Of course, spelling errors and typos require a lot of tracking and sometimes guessing. Indices, when available, work pretty well.

语义搜索
Semantic search

在这里,我们对文档的搜索考虑了我们的查询和我们正在搜索的文档中单词的含义。

Here, our search for documents takes into account the meaning of the words in both our query and in the documents within which we are searching.

以下是搜索和信息检索的常见方法:

The following are common approaches for search and information retrieval:

基于文档的TF-IDF之间的余弦相似度
Based on cosine similarity between the TF-IDF of documents

这对于包含数十亿文档的语料库很有好处。任何具有毫秒响应时间的搜索引擎都采用底层 TF-IDF 矩阵。

This is good for corpuses containing billions of documents. Any search engine with a millisecond response time employs an underlying TF-IDF matrix.

基于语义
Based on semantics

通过潜在语义分析(对于包含数百万文档的语料库)或潜在狄利克雷分配(对于更小的语料库)获得的文档主题向量之间的余弦相似度。这类似于我们使用潜在语义分析来分类消息是否是垃圾邮件,只不过现在我们计算新文档的主题向量与数据库的所有主题向量之间的余弦相似度,返回最相似的主题向量到我们的文档。

Cosine similarity between topic vectors of documents obtained through latent semantic analysis (for corpuses containing millions of documents) or latent Dirichlet allocation (for much smaller corpuses). This is similar to how we classified whether a message is spam or not spam using latent semantic analysis, except that now we compute the cosine similarity between the new document’s topic vector and all the topic vectors of our database, returning the ones that are most similar to our document.

基于特征向量迭代
Based on eigenvector iteration

这与搜索结果中的排名算法有关,例如 PageRank 算法(我们在“示例:PageRank 算法”中介绍)。以下是论文“信息检索排名算法的作用”(Choudhary 和 Burdak 2012)的有用摘录:

搜索引擎由三个重要组成部分组成。它们是爬虫、索引器和排名机制。爬虫,也称为机器人或蜘蛛,遍历网络并下载网页。这些下载的页面被发送到索引模块,该模块解析网页并根据这些页面中的关键字构建索引。索引通常使用关键字来维护。当用户使用关键字将查询键入搜索引擎的界面时,查询处理器组件将查询关键字与索引相匹配,并将URL返回给用户。但在向用户显示页面之前,搜索引擎使用排名机制将最相关的页面显示在顶部,将不太相关的页面显示在底部。

This has to do with ranking algorithms in search results, such as the PageRank algorithm (which we go over in “Example: PageRank Algorithm”). The following is a useful excerpt from the paper “Role of Ranking Algorithms for Information Retrieval” (Choudhary and Burdak 2012):

There are three important components in a search engine. They are the crawler, the indexer, and the ranking mechanism. The crawler, also called a robot or spider, traverses the web and downloads the web pages. These downloaded pages are sent to an indexing module that parses the web pages and builds the index based on the keywords in those pages. An index is generally maintained using keywords. When a user types a query using keywords into a search engine’s interface, the query processor component matches the query keywords with the index and returns URLs to the user. But before showing the pages to the user, a ranking mechanism is used by the search engine to show the most relevant pages at the top and less relevant ones at the bottom.

使用词向量(word2vec 或 GloVe)进行语义搜索和查询
Semantic search and queries using word vectors (word2vec or GloVe)

考虑一下这样的搜索,改编自《自然语言处理实际行动》一书:“她在 20 世纪初在欧洲发明了一些与物理学有关的东西。” 当我们在 Google 或 Bing 中输入搜索语句时,我们可能不会得到直接答案Marie Curie。谷歌搜索很可能只会为我们提供著名物理学家(男性和女性)列表的链接。搜索了几页后,我们找到了玛丽·居里,我们的答案。谷歌会注意到这一点,并在下次搜索时优化我们的结果。现在使用词向量,我们可以对代表“女人+欧洲+物理+科学家+著名”的词向量进行简单的算术,然后我们将获得一个新向量,与代表居里夫人的向量的余弦相似度接近,瞧!我们有我们的答案。我们甚至可以从词向量中减去自然科学中的性别偏见,只需减去代表标记manmale等的向量,这样我们就可以搜索最接近:woman+Europe+physical+scientist-male-2 的词向量*男人。

Consider a search like this one, adapted from the book Natural Language Processing in Action: “She invented something to do with physics in Europe in the early 20th century.” When we enter our search sentence into Google or Bing, we may not get the direct answer Marie Curie. Google Search will most likely only give us links to lists of famous physicists, both men and women. After searching several pages we find Marie Curie, our answer. Google will take note of that, and refine our results next time we search. Now using word vectors, we can do simple arithmetic on the word vectors representing woman+Europe+physics+scientist+famous, then we would obtain a new vector, close in cosine similarity to the vector representing Marie Curie, and voilà! We have our answer. We can even subtract gender bias in the natural sciences from word vectors by simply subtracting the vector representing the token man, male, etc., so we can search for the word vector closest to: woman+Europe+physics+scientist-male-2*man.

基于类比​​问题的搜索
Search based on analogy questions

为了计算这样的搜索:They is to music is Marie Curie is to science,我们所要做的就是对表示 Marie Curie-science+music 的单词向量进行简单的向量算术。

To compute a search such as They are to music what Marie Curie is to science, all we have to do is simple vector arithmetic of the word vectors representing Marie Curie-science+music.

以下关于索引和语义搜索的段落摘自《自然语言处理实际应用》一书:

The following paragraph, regarding indexing and semantic searches is paraphrased from the book Natural Language Processing in Action:

传统的索引方法使用二进制单词出现向量、离散向量(词袋向量)、稀疏浮点数向量(TF-IDF 向量)和低维浮点数向量(例如三维地理向量)。信息系统数据)。但是高维浮点数向量,例如来自潜在语义分析或潜在狄利克雷分配的主题向量,具有挑战性。倒排索引适用于离散向量或二进制向量,因为索引只需要为每个非零离散维度维护一个条目。该维度的值要么存在于引用的向量或文档中,要么不存在。由于 TF-IDF 向量是稀疏的,大部分为零,因此我们不需要在索引中为大多数文档的大多数维度添加条目。潜在语义分析和潜在狄利克雷分配产生高维、连续且密集的主题向量,其中零很少见。此外,语义分析算法不能为可扩展搜索产生有效的索引。维数灾难加剧了这种情况,使得精确的索引变得不可能。应对高维向量挑战的一种解决方案是使用局部敏感哈希(例如邮政编码)对它们进行索引,该哈希指定超空间的区域。这样的哈希类似于常规哈希:它是离散的并且仅取决于向量中的值。但一旦我们超过大约 12 维,即使这样也不能完美地工作。精确的语义搜索不适用于大型语料库,例如 Google 搜索甚至维基百科语义搜索。关键是满足于足够好,而不是为我们的高维向量争取完美的索引或潜在的哈希算法。现在有一些高效且准确的近似最近邻算法的开源实现,这些算法使用潜在语义哈希来有效地实现语义搜索。从技术上讲,这些索引或散列解决方案不能保证我们能为语义搜索查询找到所有最佳匹配。但是,如果我们愿意放弃一点精度,它们几乎可以像使用 TF-IDF 向量或词袋向量上的传统反向索引一样快地获得紧密匹配的良好列表。神经网络模型对主题向量的概念进行微调,使得与单词相关的向量更加精确和有用,因此加强搜索。

Traditional indexing approaches work with binary word occurrence vectors, discrete vectors (bag-of-word vectors), sparse floating-point number vectors (TF-IDF vectors), and low-dimensional floating-point number vectors (such as three-dimensional geographic information system data). But high-dimensional floating-point number vectors, such as topic vectors from latent semantic analysis or latent Dirichlet allocation, are challenging. Inverted indexes work for discrete vectors or binary vectors, because the index only needs to maintain an entry for each nonzero discrete dimension. That value of that dimension is either present or not present in the referenced vector or document. Because TF-IDF vectors are sparse, mostly zero, we do not need an entry in our index for most dimensions for most documents. Latent semantic analysis and latent Dirichlet allocation produce topic vectors that are high dimensional, continuous, and dense, where zeros are rare. Moreover, the semantic analysis algorithm does not produce an efficient index for scalable search. This is exacerbated by the curse of dimensionality, which makes an exact index impossible. One solution to the challenge of high-dimensional vectors is to index them with a locality-sensitive hash, like a zip code, that designates a region of hyperspace. Such a hash is similar to a regular hash: it is discrete and only depends on the values in the vector. But even this doesn’t work perfectly once we exceed about 12 dimensions. An exact semantic search wouldn’t work for a large corpus, such as a Google search or even a Wikipedia semantic search. The key is to settle for good enough rather than striving for a perfect index or a latent hashing algorithm for our high-dimensional vectors. There are now several open source implementations of some efficient and accurate approximate nearest neighbors algorithms that use latent semantic hashing to efficiently implement semantic search. Technically, these indexing or hashing solutions cannot guarantee that we will find all the best matches for our semantic search query. But they can get a good list of close matches almost as fast as with a conventional reverse index on a TF-IDF vector or bag-of-words vector if we are willing to give up a little precision. Neural network models fine-tune the concepts of topic vectors so that the vectors associated with words are more precise and useful, hence enhancing searches.

机器翻译

Machine Translation

这里的目标是将任意长度的标记序列(例如句子或段落)翻译为不同语言的任意长度的序列。在变压器和循环神经网络的背景下讨论的编码器-解码器架构已被证明对于翻译任务是成功的。编码器-解码器架构与自动编码器架构不同。

The goal here is to translate a sequence of tokens of any length (such as a sentence or a paragraph) to a sequence of any length in a different language. The encoder-decoder architecture, discussed in the context of transformers and recurrent neural networks, has proven successful for translation tasks. The encoder-decoder architecture is different than the auto-encoder architecture.

图像字幕

Image Captioning

将计算机视觉与自然语言处理相结合。

This combines computer vision with natural language processing.

聊天机器人

Chatbots

这是自然语言处理的最终应用。聊天机器人需要不止一种处理:解析语言、搜索、分析、生成响应、响应请求并执行它们。此外,它需要一个数据库来维护过去的陈述和响应的记忆。

This is the ultimate application of natural language processing. A chatbot requires more than one kind of processing: parse language, search, analyze, generate responses, respond to requests, and execute them. Moreover, it requires a database to maintain a memory of past statements and responses.

其他应用

Other Applications

其他应用包括命名实体识别概念聚焦、从文本中提取相关信息(例如日期)以及语言生成,我们将在第 8 章中介绍这些应用。

Other applications include named-entity recognition, conceptual focus, relevant information extraction from text (such as dates), and language generation, which we visit in Chapter 8.

变压器和注意力模型

Transformers and Attention Models

变形金刚注意力模型是机器翻译、问答、语言生成、命名实体识别、图像字幕和聊天机器人(截至 2022 年)等自然语言处理应用的最先进技术。目前,它们是大型语言模型的基础,例如Google 的 BERT(来自 Transformers 的双向编码器表示)和 OpenAI 的GPT-2(生成式预训练 Transformer)GPT-3

Transformers and attention models are the state-of-the-art for natural language processing applications such as machine translation, question answering, language generation, named-entity recognition, image captioning, and chatbots (as of 2022). Currently, they underlie large language models such as Google’s BERT (Bidirectional Encoder Representations from Transformers) and OpenAI’s GPT-2 (Generative Pre-trained Transformer) and GPT-3.

Transformer 绕过了递归和卷积架构,直到 2017 年,这都是自然语言处理应用程序的首选架构,当时论文“Attention Is All You Need”(Vaswani 等人,2017 年)介绍了第一个 Transformer 模型。

Transformers bypass both recurrence and convolution architectures, which were the go-to architectures for natural language processing applications up until 2017, when the paper “Attention Is All You Need” (Vaswani et al. 2017) introduced the first transformer model.

被废黜的循环神经网络和卷积神经网络架构仍在某些自然语言处理应用以及金融等其他应用中使用(并且运行良好)。我们将在本章后面详细介绍这些模型。然而,导致放弃它们而转向自然语言的原因是:

The dethroned recurrent and convolutional neural network architectures are still in use (and work well) for certain natural language processing applications, as well as other applications such as finance. We elaborate on these models later in this chapter. However, the reasons that led to abandoning them for natural language are:

  • 对于自然语言标记的短输入序列,变压器模型中涉及的注意层比循环层更快。即使对于长序列,我们也可以修改注意力层以仅关注输入中的某些邻域。

  • For short input sequences of natural language tokens, the attention layers that are involved in transformer models are faster than recurrent layers. Even for long sequences, we can modify attention layers to focus on only certain neighborhoods within the input.

  • 循环层所需的顺序操作的数量取决于输入序列的长度。对于注意力层来说,这个数字保持不变。

  • The number of sequential operations required by a recurrent layer depends on the length of the input sequence. This number stays constant for an attention layer.

  • 在卷积神经网络中,内核的宽度直接影响输入对和相应输出之间的长期依赖关系。跟踪长期依赖关系需要大型内核或卷积层堆栈,所有这些都会增加使用它们的自然语言模型的计算成本。

  • In convolutional neural networks, the width of the kernel directly affects the long-term dependencies between pairs of inputs and corresponding outputs. Tracking long-term dependencies then requires large kernels, or stacks of convolutional layers, all increasing the computational cost of the natural language model employing them.

变压器架构

The Transformer Architecture

变形金刚是它是巨大语言模型的组成部分,例如 GPT-2、GPT-3、Google 的 BERT(通过从左到右和从右到左查看顺序文本数据来训练语言模型)和 Wu Dao 的Transformer。这些模型非常庞大:GPT-2 拥有约 15 亿个参数,这些参数是根据来自互联网各地 800 万个网站的数百万份文档进行训练的。GPT-3 有 1750 亿个参数,在更大的数据集上进行训练。Wu Dao 的 Transformer 拥有高达 1.75 万亿个参数,消耗更多的计算资源用于训练和推理。

Transformers are an integral part of enormous language models, such as GPT-2, GPT-3, Google’s BERT (which trains the language model by looking at the sequential text data from both left to right and right to left) and Wu Dao’s transformer. These models are massive: GPT-2 has around 1.5 billion parameters trained on millions of documents, drawn from 8 million websites from all around the internet. GPT-3 has 175 billion parameters trained on an even larger data set. Wu Dao’s transformer has a whopping 1.75 trillion parameters, consuming even more computational resources for training and inference.

Transformer 最初是为语言翻译任务而设计的,因此它们具有编码器-解码器结构。图 7-4展示了最初由论文“Attention Is All You Need”(Vaswani 等人,2017)引入的 Transformer 模型的架构。然而,每个编码器和解码器都是自己的模块,因此它们可以单独使用来执行各种任务。例如,我们可以单独使用编码器来执行分类任务,例如词性标记,这意味着我们输入句子:我喜欢在我的厨房做饭,输出将是每个单词的一个类:I:代词;:动词;烹饪:名词等

Transformers were originally designed for language translation tasks, so they have an encoder-decoder structure. Figure 7-4 illustrates the architecture of the transformer model originally introduced by the paper “Attention Is All You Need” (Vaswani et al. 2017). However, each encoder and decoder is its own module, so they can be used separately to perform various tasks. For example, we can use the encoder alone to perform a classification task such as a part of speech tagging, meaning we input the sentence: I love cooking in my kitchen, and the output will be a class for each word: I: pronoun; love: verb; cooking: noun, etc.

完整 Transformer 模型(包括编码器和解码器)的输入是任意长度的自然语言标记序列,例如向聊天机器人提出的问题、需要翻译成法语的英语段落或摘要标题。输出是另一个自然语言标记序列,也可以是任意长度,例如聊天机器人的答案、法语翻译段落或标题。

The input to the full transformer model (with both the encoder and decoder included) is a sequence of natural language tokens of any length, such as a question to a chatbot, a paragraph in English that requires translation to French, or a summarization into a headline. The output is another sequence of natural language tokens, also of any length, such as the chatbot’s answer, the translated paragraph in French, or the headline.

不要将模型的训练阶段与推理阶段混淆:

Do not confuse the training phase with the inference phase of a model:

在训练中
During training

该模型同时输入数据和标签,例如英语句子(输入数据样本)及其法语翻译(标签),并且该模型学习从输入到目标标签的映射,该映射可以很好地概括为整个两种语言的词汇和语法。

The model is fed both the data and the labels, such as an English sentence (input data sample) along with its French translation (label), and the model learns a mapping from the input to the target label that generalizes well to hopefully the entire vocabularies and grammars of both languages.

推理过程中
During inference

该模型仅输入英语句子,并输出其法语翻译。变形金刚一次输出一个新的法语句子。

The model is fed only the English sentence, and outputs its French translation. Transformers output the French sentence one new token at a time.

埃麦0704
图 7-4。Transformer 模型的简单编码器-解码器架构(图片来源

编码器位于 Transformer 架构的左半部分(见图7-4),接收标记输入,例如英语句子,How was your day? ,并为该句子的每个标记生成多个数值向量表示,从句子内编码标记的上下文信息。该架构的解码器部分接收这些向量作为其输入。

The encoder, on the left half of the transformer architecture (see Figure 7-4), receives an input of tokens, such as an English sentence, How was your day?, and produces multiple numerical vector representations for each token of this sentence, encoding the token’s contextual information from within the sentence. The decoder part of the architecture receives these vectors as its input.

解码器位于该架构的右半部分,接收编码器的矢量输出以及前一个时间步的解码器输出。最终,它生成标记的输出,例如输入句子的法语翻译,Comme se passe ta journée(见图7-5)。解码器实际上使用 softmax 函数计算法语词汇中每个单词(例如 50,000 个标记)的概率,然后生成概率最高的标记。事实上,由于计算如此高维词汇的 softmax 成本很高,因此解码器使用采样的 softmax,它在每一步计算法语词汇随机样本中每个标记的概率。在训练过程中,它必须在样本中包含目标标记,但在推理过程中,没有目标标记令牌。

The decoder, on the right half of the architecture, receives the vector output of the encoder together with the decoder’s output at the previous time step. Ultimately, it generates an output of tokens, such as the French translation of the input sentence, Comme se passe ta journée (see Figure 7-5). What the decoder actually computes is a probability for each word in the French vocabulary (say, 50,000 tokens) using a softmax function, then produces the token with the highest probability. In fact, since computing a softmax for such a high-dimensional vocabulary is expensive, the decoder uses a sampled softmax, which computes the probability for each token in a random sample of the French vocabulary at each step. During training, it has to include the target token in this sample, but during inference, there is no target token.

Transformer 使用称为注意力的过程来捕获标记序列中的长期依赖性。这单词序列在这里很令人困惑,尤其是对于数学家来说,他们对术语序列序列向量列表有明显的区别。序列通常一次处理一个术语,这意味着处理一个术语,然后处理下一个术语,然后处理下一个术语,依此类推,直到消耗掉整个输入。变压器不会按顺序处理输入令牌。他们并行地一起处理它们。这与循环神经网络处理输入标记的方式不同,输入标记必须按顺序馈送,实际上禁止并行计算。如果由我们来纠正这个术语,如果我们使用变压器模型处理自然语言句子,我们应该将其称为向量,或者如果句子中的每个单词都表示为它自己的向量,则应将其称为张量,或者称为我们一次处理一批句子,这是变压器架构允许的。如果我们想使用递归神经网络模型处理相同的句子,那么我们应该将其称为序列因为该模型按顺序消耗其输入数据,一次一个令牌。如果我们使用卷积神经网络对其进行处理,那么我们将再次将其称为向量(或矩阵),因为网络将其作为一个整体来使用,而不是一次分解为一个标记。

Transformers use a process called attention to capture long-term dependencies in sequences of tokens. The word sequence is confusing here, especially for mathematicians, who have clear distinctions among the terms sequence, series, vector, and list. Sequences are usually processed one term at a time, meaning one term is processed, then the next, then the next, and so on, until the whole input is consumed. Transformers do not process input tokens sequentially. They process them all together, in parallel. This is different from the way recurrent neural networks process input tokens, which have to be fed sequentially, in effect prohibiting parallel computation. If it was up to us to correct this terminology, we should call a natural language sentence a vector if we are processing it using a transformer model, or a matrix since each word in the sentence is represented as its own vector, or a tensor if we process a batch of sentences at a time, which the architecture of the transformer allows. If we want to process the same exact sentence using a recurrent neural network model, then we should call it a sequence, since this model consumes its input data sequentially, one token at a time. If we process it using a convolutional neural network, then we would call it a vector (or matrix) again, since the network consumes it as a whole, not broken down into one token at a time.

当模型不需要顺序使用输入时,这是一个优势,因为这种架构允许并行处理。也就是说,尽管并行化使 Transformer 的计算效率更高,但它们无法充分利用自然语言输入的固有顺序性质以及在此顺序中编码的信息。想想人类如何处理文本。有一些新的变压器模型试图利用这一点。

It is an advantage when a model does not need to consume the input sequentially, because such architectures allow for parallel processing. That said, even though parallelization makes transformers computationally efficient, they cannot take full advantage of the inherent sequential nature of the natural language input and the information encoded within this sequentiality. Think of how humans process text. There are new transformer models that try to leverage this.

变压器模型运行如下:

The transformer model runs as follows:

  1. 将输入序列中的每个单词表示为 d 维向量。

  2. Represent each word from the input sequence as a d-dimensional vector.

  3. 通过添加有关其词向量的信息,将词的顺序合并到模型中位置(位置编码)。通过将每个单词的每个向量与相同长度的位置编码向量相结合,将位置信息引入到输入中。位置编码向量与词向量嵌入具有相同的维度(这允许两个向量相加)。位置编码有多种选择:有些是在训练过程中学习的,有些是固定的。具有不同频率的离散正弦和余弦函数很常见。

  4. Incorporate the order of words into the model by adding to the word vector information about its position (positional encoding). Introduce positional information into the input by accompanying each vector of each word with a positional encoding vector of the same length. The positional encoding vectors have the same dimension as the word vector embeddings (this allows the two vectors to be added together). There are many choices of positional encodings: some are learned during training, others are fixed. Discretized sine and cosine functions with varying frequencies are common.

  5. 接下来,将位置编码的词向量馈送到编码器块。编码器关注输入序列中的所有单词,无论它们是在所考虑的单词之前还是之后,因此变压器编码器是双向的

  6. Next, feed the positionally encoded word vectors to the encoder block. The encoder attends to all words in the input sequence, irrespective of if they precede or succeed the word under consideration, thus the transformer encoder is bidirectional.

  7. 解码器在时间步t–1接收其自己的预测输出字以及编码器的输出向量作为输入。

  8. The decoder receives as input its own predicted output word at time step t–1, along with the output vectors of the encoder.

  9. 解码器的输入也通过位置编码来增强。

  10. The input to the decoder is also augmented by positional encoding.

  11. 增强解码器输入被馈送到三个子层中。解码器无法处理后续单词,因此我们在其第一个子层中应用掩蔽。在第二个子层,解码器还接收编码器的输出,现在允许解码器处理输入序列中的所有单词。

  12. The augmented decoder input is fed into the three sublayers. The decoder cannot attend to succeeding words, so we apply masking in its first sublayer. At the second sublayer, the decoder also receives the output of the encoder, which now allows the decoder to attend to all of the words in the input sequence.

  13. 解码器的输出最终经过一个全连接层,然后是一个softmax层,以生成对下一个单词的预测输出序列。

  14. The output of the decoder finally passes through a fully connected layer, followed by a softmax layer, to generate a prediction for the next word of the output sequence.

注意力机制

The Attention Mechanism

Transformer 的魔力很大程度上归功于内置的注意力机制。注意力机制有好处:

The transformer’s magic is largely due to built-in attention mechanisms. An attention mechanism comes with bonuses:

可解释性
Explainability

指出模型在生成特定输出时关注输入句子(或文档)的哪些部分(见图7-5)。

Pointing out which parts of the input sentence (or document) the model paid attention to when producing a particular output (see Figure 7-5).

利用预先训练的注意力模型
Leveraging pre-trained attention models

我们可以使预先训练的模型适应特定领域的任务。也就是说,我们可以通过对特定领域数据的额外训练来进一步调整它们的参数值。

We can adapt pre-trained models to domain-specific tasks. That is, we can further tweak their parameter values with extra training on domain-specific data.

对较长句子进行更准确的建模
More accurate modeling of longer sentences

注意力机制的另一个价值是,它们允许对自然语言标记序列中的依赖关系进行建模,而不考虑这些序列中相关标记出现的距离有多远。

Another value of attention mechanisms is that they allow the modeling of dependencies in sequences of natural language tokens without regard to how far apart related tokens occur in these sequences.

图 7-5说明了英语到法语翻译任务的注意力。

Figure 7-5 illustrates attention for a translation task from English to French.

埃麦0705
图 7-5。通过翻译任务说明注意力:分配给输入标记的权重如何显示模型更关注哪些标记以生成每个输出标记(图像源

注意力机制不涉及核心数学:我们只需要计算缩放后的点积。注意力的主要目标是突出显示输入序列中最相关的部分,它们在输入本身内相互关联的程度,以及它们对输出的某些部分的贡献程度。

There is no hardcore mathematics involved in an attention mechanism: we only have to compute a scaled dot product. The main goal of attention is to highlight the most relevant parts of the input sequence, how strongly they relate to each other within the input itself, and how strongly they contribute to certain parts of the output.

自我关注当向量序列计算其自身成员内的对齐时。我们现在熟悉这样一个事实:点积衡量两个向量之间的兼容性。我们可以通过查找向量序列的所有成员之间的点积来计算最简单的自注意力权重。例如,对于句子“我喜欢在厨房里做饭” ,我们将计算代表“我”、“爱”“烹饪”、“”、“我的”和“厨房”的词向量之间的所有点积。我们期望Imy之间的点积很高,就像烹饪厨房之间的点积一样。然而, IIlovelove等之间的点积最高,因为这些向量与自身完全对齐,但那里没有收集到有价值的信息。变压器避免这种浪费的解决方案是多方面的:

Self attention is when a sequence of vectors computes alignment within its own members. We are now familiar with the fact that the dot product measures the compatibility between two vectors. We can compute the simplest possible self attention weights by finding the dot products between all the members of the sequence of vectors. For example, for the sentence I love cooking in my kitchen, we would compute all the dot products between the word vectors representing the words I, love, cooking, in, my, and kitchen. We would expect the dot product between I and my to be high, similarly between cooking and kitchen. However, the dot product will be highest between I and I, love and love, etc., because these vectors are perfectly aligned with themselves, but there is no valuable information gleaned there. The transformer’s solution to avoiding this waste is multifold:

  1. 对输入序列的每个向量(句子的每个单词)应用三种不同的变换,将它们乘以三个不同的权重矩阵。然后我们获得对应于每个输入词向量\(\vec{w}\)的三组不同的向量:

    • 查询向量_ qery = q w ,矢量来自

    • 关键向量_ key = k w ,矢量关注

    • 价值向量_ vAe = v w ,捕获正在生成的上下文。

  2. Apply three different transformations to each vector of the input sequence (each word of the sentence), multiplying them by three different weight matrices. We then obtain three different sets of vectors corresponding to each input word vector \(\vec{w}\):

    • The query vector query = W q w , the vector attended from.

    • The key vector key = W k w , the vector attended to.

    • The value vector value = W v w , to capture the context that is being generated.

  3. 通过计算按这些向量长度的平方根的倒数缩放的点积,获得句子中所有单词的查询向量和关键向量之间的对齐分数 。我们应用这种缩放比例来保持数值稳定性,以防止点积变大。(这些点积很快就会被传递到一个softmax函数中。由于softmax函数在其输入具有很大幅度时具有非常小的梯度,因此我们通过将每个点积除以来抵消这种影响 s q r t .)此外,两个向量的对齐与这些向量的长度无关。因此,我们句子中烹饪厨房之间的对齐分数将是:

    A G n e n t CknG,ktCHen = 1 qery CknG t key ktCHen

    请注意,这将不同于kitchenCooking之间的对齐分数,因为每个的查询和键向量都不同。因此,所得的对齐矩阵不是对称的。

  4. Obtain alignment scores between the query and key vectors for all words in the sentence by computing their dot product scaled by the inverse of the square root of the length of these vectors l . We apply this scaling for numerical stability to keep the dot products from becoming large. (These dot products will soon be passed into a softmax function. Since the softmax function has a very small gradient when its input has a large magnitude, we offset this effect by dividing each dot product by s q r t l .) Moreover, alignment of two vectors is independent from the lengths of these vectors. Therefore, the alignment score between cooking and kitchen in our sentence will be:

    a l i g n m e n t cooking,kitchen = 1 l query cooking t key kitchen

    Note that this will be different than the alignment score between kitchen and cooking, since the query and key vectors for each are different. Thus, the resulting alignment matrix is not symmetric.

  5. 通过将分数传递到softmax函数,将句子中每两个单词之间的每个对齐分数转换为概率。例如:

    ω CknG,ktCHen = s F t A X A G n e n t CknG,ktCHen = eXpAGnent CknG,ktCHen {eXpAGnent CknG, + eXpAGnent CknG,ve + eXpAGnent CknG,CknG + eXpAGnent CknG,n + eXpAGnent CknG,y + eXpAGnent CknG,ktCHen }
  6. Transform each alignment score between each two words in the sentence into a probability by passing the score into the softmax function. For example:

    ω cooking,kitchen = s o f t m a x ( a l i g n m e n t cooking,kitchen ) = exp(alignment cooking,kitchen ) {exp(alignment cooking,I )+ exp(alignment cooking,love )+ exp(alignment cooking,cooking )+ exp(alignment cooking,in )+ exp(alignment cooking,my )+ exp(alignment cooking,kitchen )}
  7. 最后,通过使用对齐概率作为线性组合的权重来线性组合值向量,对每个单词的上下文进行编码。例如:

    C n t e X t CknG = ω CknG, vAe + ω CknG,ve vAe ve + ω CknG,CknG vAe CknG + ω CknG,n vAe n + ω CknG,y vAe y + ω CknG,ktCHen vAe ktCHen
  8. Finally, encode the context of each word by linearly combining the value vectors using the alignment probabilities as weights for the linear combination. For example:

    c o n t e x t cooking = ω cooking,I value I + ω cooking,love value love + ω cooking,cooking value cooking + ω cooking,in value in + ω cooking,my value my + ω cooking,kitchen value kitchen

因此,我们成功地在一个向量中捕获给定句子中每个单词的上下文,并为句子中与其最匹配的单词分配了高价值。

We have thus managed to capture in one vector the context of each word in the given sentence, with high worth assigned to those words in the sentence it mostly aligns with.

好消息是,我们可以同时计算句子(数据样本)中所有单词的上下文向量,因为我们可以将前面提到的向量打包到矩阵中,并使用高效且并行的矩阵计算来获取所有术语的上下文立刻。

The good news here is that we can compute the context vector for all the words of a sentence (data sample) simultaneously, since we can pack the vectors previously mentioned in matrices and use efficient and parallel matrix computations to get the contexts for all the terms at once.

我们将上述所有内容实施在注意力头。也就是说,一个注意力头为数据样本中的每个标记生成一个上下文向量。我们将从为同一个标记生成多个上下文向量中受益,因为在生成上下文向量的过程中发生所有平均时,一些信息会丢失。这里的想法是能够使用句子(数据样本)的术语的不同表示来提取信息,而不是对应于单个注意力头的单个表示。所以我们实施多头注意力,为每个头选择新的变换矩阵 q , k , 和 v

We implement all of the above in one attention head. That is, one attention head produces one context vector for each token in the data sample. We would benefit from producing multiple context vectors for the same token, since some information gets lost with all the averaging happening on the way to a context vector. The idea here is to be able to extract information using different representations of the terms of a sentence (data sample), as opposed to a single representation corresponding to a single attention head. So we implement a multihead attention, choosing for each head new transformation matrices W q , W k , and W v .

请注意,在训练过程中,变换矩阵的条目是必须从训练数据样本中学习的模型参数。想象一下模型参数的数量会以多快的速度膨胀。

Note that during the training process, the entries of the transformation matrices are model parameters that have to be learned from the training data samples. Imagine then how fast the number of the model’s parameters will balloon.

图 7-6说明了多头注意力机制,实现了h个头,这些头接收查询、键和值的不同线性变换版本,为每个标记生成h 个上下文向量,然后将其连接起来以产生多头注意力部分的输出模型的结构。

Figure 7-6 illustrates the multihead attention mechanism, implementing h heads that receive different linearly transformed versions of the queries, keys, and values to produce h context vectors for each token, which are then concatenated to produce the output of the multihead attention part of the model’s structure.

解码器使用类似的自注意力机制,但这里每个单词只能关注它之前的单词,因为文本是从左到右生成的。此外,解码器有一个额外的注意机制,用于处理从解码器接收到的输出。编码器。

The decoder uses a similar self attention mechanism, but here each word can only attend to the words before it, since text is generated from left to right. Moreover, the decoder has an extra attention mechanism attending to the outputs it receives from the encoder.

埃麦0706
图 7-6。多头注意力机制(图片来源

变形金刚远非完美

Transformers Are Far from Perfect

甚至尽管 Transformer 模型彻底改变了自然语言处理领域,但它们还远非完美。一般来说,语言模型是无意识的模仿。他们既不了解自己的输入,也不了解自己的输出。诸如 《麻省理工学院技术评论》这篇文章和这篇文章等批评文章详细介绍了它们的缺点,例如缺乏语言理解、用于生成长文本段落时的重复等等。也就是说,Transformer 模型带来了自然语言的浪潮,并且正在向生物医学、计算机视觉和图像生成等其他人工智能领域进军

Even though transformer models have revolutionized the natural language processing field, they are far from perfect. Language models, in general, are mindless mimics. They understand neither their inputs nor their outputs. Critical articles, such as this and this by the MIT Technology Review, among others, detail their shortcomings, such as lack of comprehension of language, repetition when used to generate long passages of text, and so on. That said, the transformer model brought about a tidal wave for natural language, and it is making its way to other AI fields, such as biomedicine, computer vision, and image generation.

时间序列数据的卷积神经网络

Convolutional Neural Networks for Time Series Data

自然语言处理和金融领域中的术语时间序列应该是时间序列数学中的级数是指将无限序列的项相加。因此,当我们的数据没有进行求和时(所有自然语言数据和大多数金融数据都是这种情况),我们实际上拥有的是数字、向量等的序列,而不是级数。哦,词汇冲突是不可避免的,即使是在不同的领域彼此严重依赖。

The term time series in the natural language processing and finance domains should be time sequence instead. Series in mathematics refers to adding up the terms of an infinite sequence. So when our data is not summed, which is the case for all natural language data and most finance data, we actually have sequences, not series, of numbers, vectors, etc. Oh well, vocabulary collisions are unavoidable, even across different fields that heavily rely on one another.

除了字典中单词的定义之外,它们的含义主要与它们相对于彼此出现的方式相关。这是通过单词在句子中的排序方式以及它们的上下文和与句子中其他单词的接近程度来传达的。

Other than the definition of a word in a dictionary, their meanings are mostly correlated to the way they occur relative to each other. This is conveyed through the way words are ordered in sentences, as well as their context and proximity to other words in sentences.

我们首先强调探索文档中单词和术语背后含义的两种方法:

We first emphasize the two ways in which we can explore the meanings behind words and terms in documents:

空间上
Spatially

一次性将一个句子作为一个标记向量进行探索,无论这些标记以何种数学方式表示。

Exploring a sentence all at once as one vector of tokens, whichever way these tokens are represented mathematically.

暂时地
Temporarily

按顺序探索一个句子,一次一个标记。

Exploring a sentence sequentially, one token at a time.

第 5 章讨论的卷积神经网络通过沿着句子的标记滑动固定宽度的窗口(内核或过滤器)来探索句子的空间。当使用卷积神经网络分析文本数据时,网络需要固定维度的输入。另一方面,当使用循环神经网络(接下来讨论)来分析文本数据时,网络会顺序期望标记,因此输入不需要具有固定长度。

Convolutional neural networks, discussed in Chapter 5, explore sentences spatially, by sliding a fixed-width window (kernel or filter) along the tokens of the sentence. When using convolutional neural networks to analyze text data, the network expects an input of fixed dimensions. On the other hand, when using recurrent neural networks (discussed next) to analyze text data, the network expects tokens sequentially, hence, the input does not need to be of fixed length.

第 5 章中,我们在图像上滑动二维窗口(内核或滤波器),在本章中,我们将在文本标记上滑动一维内核。我们现在知道每个标记都表示为数字向量。我们可以使用 one-hot 编码或 word2vec 模型的词向量。One-hot 编码标记用一个非常长的向量表示,该向量对于我们想要从语料库中包含的每个可能的词汇单词都有一个 0,并且在我们正在编码的标记位置处有一个 1。或者,我们可以使用通过 word2vec 生成的经过训练的词向量。因此,卷积神经网络的输入数据样本是由列向量组成的矩阵,数据样本中的每个标记对应一列。如果我们使用 word2vec 来表示 token,那么每个列向量将有 100 到 500 个条目,具体取决于所使用的特定 word2vec 模型。回想一下,对于卷积神经网络,每个数据样本必须具有完全相同数量的标记。

In Chapter 5, we were sliding two-dimensional windows (kernels or filters) over images, and in this chapter, we will slide one-dimensional kernels over text tokens. We now know that each token is represented as a vector of numbers. We can either use one-hot encoding or word vectors of the word2vec model. One-hot encoded tokens are represented with a very long vector that has a 0 for every possible vocabulary word that we want to include from the corpus, and a 1 in the position of the token we are encoding. Alternatively, we can use trained word vectors produced via word2vec. Thus, an input data sample to the convolutional neural network is a matrix made up of column vectors, one column for each token in the data sample. If we use word2vec to represent tokens, then each column vector would have 100 to 500 entries, depending on the particular word2vec model used. Recall that for a convolutional neural network, each data sample has to have the exact same number of tokens.

因此,一个数据样本(一句话或者一段)用一个二维矩阵来表示,其中行数就是词向量的全长。在这种情况下,说我们在数据样本上滑动一维内核有点误导,但这里是解释。样本标记的向量表示向下延伸;然而,过滤器一次覆盖了该尺寸的整个长度。也就是说,如果过滤器的宽度为三个标记,那么它将是一个具有三列和与标记的向量表示一样多的行的权重矩阵。因此,这里的一维卷积指的是仅水平方向的卷积。这与图像的二维卷积不同,其中二维滤波器水平和垂直地穿过图像。

Therefore, one data sample (a sentence or a paragraph) is represented with a two-dimensional matrix, where the number of rows is the full length of the word vector. In this context, saying that we are sliding a one-dimensional kernel over our data sample is slightly misleading, but here is the explanation. The vector representation of the sample’s tokens extends downward; however, the filter covers the whole length of that dimension all at once. That is, if the filter is three tokens wide, it would be a matrix of weights with three columns and as many rows as the vector representation of our tokens. Thus, one-dimensional convolution here refers to convolving only horizontally. This is different than two-dimensional convolution for images, where the two-dimensional filter travels across the image both horizontally and vertically.

第 5 章一样,在前向传递期间,滤波器中的权重值对于一个数据样本是相同的。这意味着我们可以并行化该过程,这就是卷积神经网络训练高效的原因。

As in Chapter 5, during a forward pass, the weight values in the filters are the same for one data sample. This means that we can parallelize the process, which is why convolutional neural networks are efficient to train.

回想一下,卷积神经网络还可以同时处理多个通道的输入,即输入的三维张量,而不仅仅是数字的二维矩阵。对于图像,这是同时处理输入图像的红色、绿色和蓝色通道。对于自然语言,一个输入样本是一堆表示为彼此相邻排列的列向量的单词。我们现在知道有多种方法可以将同一个单词表示为数字向量,每种方法可能捕获同一个单词的不同语义。同一单词的这些不同向量表示不一定具有相同的长度。如果我们限制它们的长度相同,那么每个表示都可以是一个单词的通道,并且卷积神经网络可以一次处理同一数据样本的所有通道。

Recall that convolutional neural networks can also process more than one channel of input at the same time, that is, three-dimensional tensors of input, not only two-dimensional matrices of numbers. For images this was processing the red, green, and blue channels of an input image all at once. For natural language, one input sample is a bunch of words represented as column vectors lined up next to each other. We now know that there are multiple ways to represent the same word as a vector of numbers, each perhaps capturing different semantics of the same word. These different vector representations of the same word are not necessarily of the same length. If we restrict them to be of the same length, then each representation can be a word’s channel, and the convolutional neural network can process all the channels of the same data sample at once.

正如第 5 章所述,卷积神经网络由于权重共享、池化层、dropout 和较小的滤波器尺寸而非常高效。我们可以使用多个尺寸过滤器运行模型,然后将每个尺寸过滤器的输出连接成一个更长的思想向量,然后将其传递到完全连接的最后一层。当然,网络的最后一层完成了预期的任务,例如情感分类、垃圾邮件过滤、文本生成等。我们在第 5 章和6章中讨论了这个问题。

As in Chapter 5, convolutional neural networks are efficient due to weight sharing, pooling layers, dropout, and small filter sizes. We can run the model with multiple size filters, then concatenate the output of each size filter into a longer thought vector before passing it into the fully connected last layer. Of course, the last layer of the network accomplishes the desired task, such as sentiment classification, spam filtering, text generation, and others. We went over this in Chapters 5 and 6.

时间序列数据的递归神经网络

Recurrent Neural Networks for Time Series Data

考虑下面三句话:

Consider the following three sentences:

  • 她买了票去看电影。

  • She bought tickets to watch the movie.

  • 她趁着空闲,买了票去看电影。

  • She, having free time, bought tickets to watch the movie.

  • 连续两周不断听到这个消息的她,终于买了票去看电影。

  • She, having heard about it nonstop for two weeks in a row, finally bought tickets to watch the movie.

在这三个句子中,谓语buy Tickets to watch the movie对应于句子的主语She。如果自然语言模型被设计来处理,它就能够学习这一点长期依赖。让我们探讨一下不同的模型如何处理这种长期依赖关系:

In all three sentences, the predicate bought tickets to watch the movie corresponds to the sentence’s subject She. A natural language model will be able to learn this if it is designed to handle long-term dependencies. Let’s explore how different models handle such long-term dependencies:

卷积神经网络和长期依赖性
Convolutional neural networks and long-term dependencies

A卷积神经网络的过滤窗口范围很窄,扫描句子的三到五个标记,将能够轻松地从第一个句子中学习,也许还有第二个句子,因为谓词的位置只改变了一点点(池化层有助于网络对微小变化的抵抗力)。第三句话会很困难,除非我们使用更大的过滤器(这会增加计算成本并使网络更像是一个完全连接的网络而不是卷积网络),或者如果我们加深网络,将卷积层堆叠在一起,这样随着句子深入网络,覆盖范围也会扩大。

A convolutional neural network, with its narrow filtering window ranging three to five tokens scanning the sentence, will be able to learn from the first sentence easily, and maybe the second sentence, given that the predicate’s position changed only a little bit (pooling layers help with the network’s resistance to small variations). The third sentence will be tough, unless we use larger filters (which increases the computation cost and makes the network more like a fully connected network than a convolutional network), or if we deepen the network, stacking convolutional layers on top of each other so that the coverage widens as the sentence makes its way deeper into the network.

具有记忆单元的循环神经网络
Recurrent neural networks with memory units

A完全不同的方法是将句子按顺序输入网络,一次一个令牌,并维持一种状态记忆,在一定时间内保留重要信息。当句子中的所有标记都通过网络时,网络就会产生结果。如果这是在训练期间,则仅将处理最后一个标记后产生的结果与句子的标签进行比较,然后误差随时间反向传播,以调整权重。将此与我们在阅读长句子或段落时保留信息的方式进行比较。具有长短期记忆单元的循环神经网络就是这样设计的。

A completely different approach is to feed the sentence into the network sequentially, one token at a time, and maintain a state and a memory that hold on to important information for a certain amount of time. The network produces an outcome when all the tokens in the sentence have passed through it. If this is during training, only the outcome produced after the last token has been processed gets compared to the sentence’s label, then the error backpropagates through time, to adjust the weights. Compare this to the way we hold on to information when reading a long sentence or paragraph. Recurrent neural networks with long short-term memory units are designed this way.

Transformer 模型和长期依赖性
Transformer models and long-term dependencies

变压器型号、我们之前讨论过,取消卷积和递归,仅依靠注意力来捕获句子“ She”的主语和谓语“ buy Tickets to watch the movie”之间的关系。

Transformer models, which we discussed earlier, abolish both convolution and recurrence, relying only on attention to capture the relationship between the subject of the sentence She, and the predicate bought tickets to watch the movie.

递归模型与卷积模型和变换器模型的另一个区别是:该模型是否期望所有数据样本的输入长度相同?我们只能输入相同长度的句子吗?变压器和卷积网络的答案是它们只期望固定长度的数据样本,因此我们必须预处理我们的样本并使它们都具有相同的长度。另一方面,循环神经网络可以很好地处理可变长度输入,因为毕竟它们一次只获取一个标记。

One more thing differentiates recurrence models from convolutional and transformer models: does the model expect its input to be of the same length for all data samples? Can we only input sentences of the same length? The answer for transformers and convolutional networks is that they expect only fixed-length data samples, so we have to preprocess our samples and make them all the same length. On the other hand, recurrent neural networks handle variable length inputs really well, since after all they only take them one token at a time.

循环神经网络的主要思想是,它在处理新信息时保留过去的信息。这种持有是如何发生的?在前馈网络中,神经元的输出离开它并且永远不会返回到它。在循环网络中,输出与新输入一起循环回神经元,本质上创建了记忆。此类算法非常适合自动完成和语法检查。自 2018 年以来,它们已集成到 Gmail 的智能撰写中。

The main idea for a recurrent neural network is that it holds on to past information as it processes new information. How does this holding happen? In a feed forward network, the output of a neuron leaves it and never gets back to it. In a recurrent network, the output loops back into the neuron, along with new input, in essence creating a memory. Such algorithms are great for autocompletion and grammar check. They have been integrated into Gmail’s Smart Compose since 2018.

循环神经网络如何工作?

How Do Recurrent Neural Networks Work?

这里有循环神经网络如何在一组标记数据样本上进行训练的步骤。每个数据样本都由一堆标记和一个标签组成。与往常一样,网络的目标是学习数据中的一般特征和模式,最终产生特定的标签(或输出)与其他标签(或输出)。当每个样本的标记按顺序输入时,我们的目标是在所有数据样本中检测当某些标记以相对于彼此的模式出现时出现的特征:

Here are the steps for how a recurrent neural network gets trained on a set of labeled data samples. Each data sample is made up of a bunch of tokens and a label. As always, the goal of the network is to learn the general features and patterns within the data that end up producing a certain label (or output) versus others. When tokens of each sample are input sequentially, our goal is then to detect the features, across all the data samples, that emerge when certain tokens appear in patterns relative to each other:

  1. 从数据集中获取一个标记化和标记的数据样本(例如标记为正面的电影评论或标记为假新闻的推文)。

  2. Grab one tokenized and labeled data sample from your data set (such as a movie review labeled positive, or a tweet labeled fake news).

  3. 传递第一个令牌 t k e n 0 将您的样本放入网络中。请记住,令牌是矢量化的,因此您实际上是将数字向量传递到网络中。用数学术语来说,我们正在评估该标记向量的函数并生成另一个向量。到目前为止,我们的网络已经计算出 F t k e n 0

  4. Pass the first token t o k e n 0 of your sample into the network. Remember that tokens are vectorized, so you are really passing a vector of numbers into the network. In mathematical terms, we are evaluating a function at that token’s vector and producing another vector. So far, our network has calculated f ( t o k e n 0 ) .

  5. 现在传递第二个令牌 t k e n 1 将样本连同第一个令牌的输出一起放入网络中, F t k e n 0 。网络现在将评估 F t k e n 1 + F t k e n 0 。这就是递归步骤,这就是网络不会忘记的方式 t k e n 0 当它处理 t k e n 1

  6. Now pass the second token t o k e n 1 of your sample into the network, along with the output of the first token, f ( t o k e n 0 ) . The network now will evaluate f ( t o k e n 1 + f ( t o k e n 0 ) ) . This is the recurrence step, and this is how the network does not forget t o k e n 0 as it processes t o k e n 1 .

  7. 现在传递第三个令牌 t k e n 2 将样本连同上一步的输出一起放入网络中, F t k e n 1 + F t k e n 0 。网络现在将评估 F t k e n 2 + F t k e n 1 + F t k e n 0

  8. Now pass the third token t o k e n 2 of your sample into the network, along with the output of the previous step, f ( t o k e n 1 + f ( t o k e n 0 ) ) . The network now will evaluate f ( t o k e n 2 + f ( t o k e n 1 + f ( t o k e n 0 ) ) ) .

  9. 继续下去,直到完成一个样本的所有标记。假设这个样本只有五个令牌,那么我们的循环网络将输出 F t k e n 4 + F t k e n 3 + F t k e n 2 + F t k e n 1 + F t k e n 0 请注意,此输出看起来与我们在第 4 章中讨论的前馈全连接网络的输出非常相似,不同之处在于,当我们将样本的标记一次输入一个循环神经元时,此输出会随时间展开,而在第 4 章中,当一个数据样本从神经网络的一层移动到下一层时,网络的输出在空间 中展开。从数学上讲,当我们写出每个公式时,它们是相同的,因此除了我们在过去三章中学到的知识之外,我们不需要更多的数学知识。这就是我们热爱数学的原因。

  10. Keep going until you finish all the tokens of your one sample. Suppose this sample only had five tokens, then our recurrent network will output f ( t o k e n 4 + f ( t o k e n 3 + f ( t o k e n 2 + f ( t o k e n 1 + f ( t o k e n 0 ) ) ) ) ) . Note that this output looks very similar to the output of the feed forward fully connected networks that we discussed in Chapter 4, except that this output unfolds through time as we input a sample’s tokens one at a time into one recurrent neuron, while in Chapter 4, the network’s output unfolds through space, as one data sample moves from one layer of the neural network to the next. Mathematically, when we write the formulas of each, they are the same, so we don’t need more math beyond what we learned in the past three chapters. That’s why we love math.

  11. 当训练网络产生正确的东西时,它就是样本的最终输出, F t k e n 4 + F t k e n 3 + F t k e n 2 + F t k e n 1 + F t k e n 0 ,通过评估损失函数将其与样本的真实标签进行比较,正如我们在第 3 章第 4章和第 5 章中所做的那样。

  12. When training the network to produce the right thing, it is the final output of the sample, f ( t o k e n 4 + f ( t o k e n 3 + f ( t o k e n 2 + f ( t o k e n 1 + f ( t o k e n 0 ) ) ) ) ) , that gets compared against the sample’s true label via evaluating a loss function, exactly as we did in Chapters 3, 4, and 5.

  13. 现在将下一个数据样本一次一个令牌传递到网络中,并再次执行相同的操作。

  14. Now pass the next data sample one token at a time into the network and do the same thing again.

  15. 我们更新网络权重的方式与第 4 章中更新网络权重的方式完全相同,即通过基于梯度下降的算法最小化损失函数,其中我们通过反向传播计算所需的梯度(相对于所有网络权重的导数)。正如我们刚才所说,这与我们在第 4 章中学到的反向传播数学完全相同,当然,除了现在,我们可以说我们正在通过时间进行反向传播

  16. We update the network’s weights in exactly the same way we updated them in Chapter 4, by minimizing the loss function via a gradient descent-based algorithm, where we calculate the required gradient (the derivatives with respect to all the network’s weights) via backpropagation. As we just said, this is exactly the same backpropagation mathematics we learned in Chapter 4, except now, of course, we get to say we’re backpropagating through time.

金融、动力学和反馈控制,这个过程称为自回归移动平均(ARMA)模型

In finance, dynamics, and feedback control, this process is called an autoregressive moving average (ARMA) model.

训练循环神经网络的成本可能很高,特别是对于任意长度的数据样本(例如 10 个令牌或更多),因为要学习的权重数量与数据样本中的令牌数量直接相关:令牌越多,深度越深及时循环网络已经。除了计算成本之外,这种深度还伴随着多层常规前馈网络遇到的所有麻烦:梯度消失或爆炸,尤其是具有数百个标记的数据样本,这在数学上相当于完全连接的前馈网络数百层神经网络!对于前馈网络梯度爆炸和消失的相同补救措施在这里工作。

Training a recurrent neural net can be expensive, especially for data samples of any significant length, say 10 tokens or more, since the number of weights to learn is directly related to the number of tokens in data samples: the more tokens, the more depth in time the recurrent network has. Other than the computational cost, this depth comes with all the troubles encountered by regular feed forward networks with many layers: vanishing or exploding gradients, especially with samples of data with hundreds of tokens, which will be the mathematical equivalent of a fully connected feed forward neural network with hundreds of layers! The same remedies for exploding and vanishing gradients for feed forward networks work here.

门控循环单元和长短期记忆单元

Gated Recurrent Units and Long Short-Term Memory Units

复发性循环网络中的神经元不足以捕获句子中的长期依赖关系。随着更多令牌通​​过循环神经元,令牌的效果会被新信息削弱并增强。事实上,只有两个令牌过去后,一个令牌的信息就几乎完全消失了。如果我们将记忆单元(称为长短期记忆单元)添加到网络的架构。这些有助于学习跨整个数据样本的依赖关系。

Recurrent neurons in recurrent networks are not enough to capture long-term dependencies in a sentence. A token’s effect gets diluted and stepped on by the new information as more tokens pass through the recurrent neuron. In fact, a token’s information is almost completely gone only after two tokens have passed. This problem can be addressed if we add memory units, called long short-term memory units, to the architecture of the network. These help learning dependencies stretching across a whole data sample.

长短期记忆单元包含神经网络,它们可以被训练为仅查找需要为即将到来的输入保留的新信息,并忘记或重置为零,不再与学习相关的信息。因此,长短期记忆单元学习保留哪些信息,而网络的其余部分则学习预测目标标签。

Long short-term memory units contain neural networks, and they can be trained to find only the new information that needs to be retained for the upcoming input, and to forget, or reset to zero, information that is no longer relevant to learning. Therefore, long short-term memory units learn which information to hold on to, while the rest of the network learns to predict the target label.

除了我们在第 4 章中学到的内容之外,没有新的数学知识,因此我们不会深入挖掘长短期记忆单元的具体架构,或者一个门控单元。总之,每个时间步的输入令牌都会通过遗忘门和更新门(函数),乘以权重和掩码,然后存储在存储单元中。网络的下一个输出取决于输入令牌和存储单元当前状态的组合。此外,长短期记忆单元共享它们在样本中学到的权重,因此它们在遍历每个样本的标记时不必重新学习有关语言的基本信息。

There is no new mathematics beyond what we learned in Chapter 4 here, so we will not go into the weeds digging into the specific architecture of a long short-term memory unit, or a gated unit. In summary, the input token for each time step passes through the forget and update gates (functions), gets multiplied by weights and masks, then gets stored in a memory cell. The network’s next output depends on a combination of the input token and the memory unit’s current state. Moreover, long short-term memory units share the weights they learned across samples, so they do not have to relearn basic information about language as they go through each sample’s tokens.

人类能够在潜意识层面上处理语言,而长短期记忆单元是对其进行建模的一个步骤。它们能够检测语言中的模式,使我们能够解决比单纯分类更复杂的任务,例如语言生成。我们可以从学习到的概率分布生成新颖的文本。这是第8章的主题。

Humans are able to process language on a subconscious level, and long short-term memory units are a step into modeling that. They are able to detect patterns in language that allow us to address more complex tasks than mere classification, such as language generation. We can generate novel text from learned probability distributions. This is the topic of Chapter 8.

自然语言数据的示例

An Example of Natural Language Data

当面对对于不同模型的叙述,当我们考虑到真实数据的具体示例以及模型的超参数时,事情总是会变得更容易。我们可以在斯坦福人工智能网站上找到IMDb电影评论数据集。每个数据样本都标有 0(负面评论)或 1(正面评论)。如果我们想练习预处理自然语言文本,我们可以从原始文本数据开始。然后我们可以对所选词汇、Google word2vec 模型或其他模型使用 one-hot 编码对其进行标记和向量化。不要忘记将数据分为训练集和测试集。然后选择超参数,例如,词向量的长度约为 300,每个数据样本的标记数量约为 400,小批量为 32,纪元数为 2。我们可以尝试使用这些参数来感受模型的效果表现。

When faced with narratives about different models, it always makes things easier when we have specific examples in mind with real data, along with the models’ hyper-parameters. We can find the IMDb movie review data set at the Stanford AI website. Each data sample is labeled with a 0 (negative review) or a 1 (positive review). We can start with the raw text data if we want to practice preprocessing natural language text. Then we can tokenize it and vectorize it using one-hot encoding over a chosen vocabulary, the Google word2vec model, or some other model. Do not forget to split the data into the training and test sets. Then choose the hyperparameters, for example, length of word vectors around 300, number of tokens per data sample around 400, mini-batches of 32, and number of epochs 2. We can play around with these to get a feel for the models’ performance.

金融人工智能

Finance AI

人工智能模型在金融领域有着广泛的用途。到目前为止,我们已经了解了大多数人工智能模型的底层结构(图除外,它们具有不同的数学结构;我们将在第 9 章中讨论图网络)。此时,仅提及金融领域的一个应用领域就足以让我们对如何使用人工智能对其进行建模有一个很好的了解。此外,许多金融应用程序自然地与自然语言处理应用程序交织在一起,例如基于客户评论的营销决策,或用于预测经济趋势并仅基于模型输出触发大型金融交易的自然语言处理系统。

AI models have a wide use for the finance field. By now, we know the underlying structure of most AI models (except for graphs, which have a different mathematical structure; we discuss graph networks in Chapter 9). At this point, only mentioning an application area from finance is enough for us to have a very good idea about how to go about modeling it using AI. Moreover, many finance applications are naturally intertwined with natural language processing applications, such as marketing decisions based on customer reviews, or a natural language processing system used to predict economic trends and trigger large financial transactions based only on the models’ outputs.

以下只是众多人工智能应用中的两个。想一想如何利用我们迄今为止所学到的知识来建模这些问题:

The following are only two AI applications in finance, among many. Think of ways we can put what we have learned so far to good use modeling these problems:

  • 股票市场时间序列预测。循环神经网络可以接受一系列输入并产生一系列输出。这对于股票价格所需的时间序列预测很有用。我们输入过去n天的价格,网络输出过去n-1天的价格以及明天的价格

  • Stock market time series prediction. A recurrent neural network can take a sequence of inputs and produce a sequence of outputs. This is useful for the time series prediction required for stock prices. We input the prices over the past n days, and the network outputs the prices from the past n-1 days along with tomorrow’s price.

  • 金融、动力学和反馈控制中的自回归移动平均 (ARMA) 模型。

  • Autoregressive moving average (ARMA) model in finance, dynamics, and feedback control.

股票市场在本书中多次出现。当我们在第 11 章概率中讨论随机过程时,请留意它。

The stock market appears multiple times in this book. Keep an eye out for it when we are discussing stochastic processes in Chapter 11 on probability.

总结与展望

Summary and Looking Ahead

本章几乎没有新的数学内容;然而,这是最难写的之一。目标是总结整个自然语言处理领域最重要的思想。从单词转向具有意义的相对低维数字向量是需要克服的主要障碍。一旦我们学会了多种方法来做到这一点,无论是一次矢量化一个单词还是一个长文档或整个语料库中的主要主题,将这些矢量输入到具有不同架构和目的的不同机器学习模型就一切如常了。

There was almost no new math in this chapter; however, it was one of the hardest to write. The goal was to summarize the most important ideas in the whole natural language processing field. Moving from words to relatively low-dimensional vectors of numbers that carry meaning was the main barrier to overcome. Once we learned multiple ways to do this, whether vectorizing one word at a time or the main topics in a long document or an entire corpus, feeding those vectors to different machine learning models with different architectures and purposes was just business as usual.

结石
Calculus

术语频率和逆文档频率的对刻度

The log scale for term frequencies and inverse document frequencies

统计数据
Statistics
  • Zipf 字数统计定律

  • 用于将单词分配给主题以及将主题分配给文档的狄利克雷概率分布

  • Zipf’s law for word counts

  • The Dirichlet probability distribution for assigning words to topics and topics to documents

线性代数
Linear algebra
  • 自然语言文档的矢量化

  • 两个向量的点积以及它如何提供向量表示的实体之间的相似性或兼容性的度量

  • 余弦相似度

  • 奇异值分解,即潜在语义分析

  • Vectorizing documents of natural language

  • The dot product of two vectors and how it provides a measure of similarity or compatibility between the entities that the vectors represent

  • Cosine similarity

  • Singular value decomposition, i.e., latent semantic analysis

可能性
Probability
  • 条件概率

  • 双线性对数模型

  • Conditional probabilities

  • Bilinear log model

时间序列数据
Time series data
  • 这是什么意思

  • 如何将其输入机器学习模型(批量或一次一件)

  • What it means

  • How it is fed into machine learning models (as one bulk, or one item at a time)

当今的 AI 模型
AI model of the day
  • 变压器

  • The transformer

第 8 章概率生成模型

Chapter 8. Probabilistic Generative Models

人工智能将我所知道的所有数学联系在一起,多年来我一直在了解数学。

H。

AI ties up all the math that I know together, and I have been getting to know math for years.

H.

如果机器永远被赋予对周围世界的理解,以及重新创造它的能力,就像我们想象、做梦、画画、创作歌曲、看电影或写书一样,那么生成模型就是重要的一步朝那个方向。如果我们想要实现通用人工智能,我们就需要正确地建立这些模型。

If machines are ever to be endowed with an understanding of the world around them, and an ability to recreate it, like we do when we imagine, dream, draw, create songs, watch movies, or write books, then generative models are one significant step in that direction. We need to get these models right if we are ever going to achieve general artificial intelligence.

生成模型建立在这样的假设之上:只有当我们的模型了解了输入数据的底层统计结构时,我们才能正确解释输入数据。这与我们的做梦过程大致相似,这表明我们的大脑可能已经学会了一种能够虚拟地重建我们的环境的模型。

Generative models are built on the assumption that we can only interpret input data correctly if our model has learned the underlying statistical structure of this data. This is loosely analogous to our dreaming process, which points to the possibility that our brain has learned a model that is able to virtually recreate our environment.

在本章中,我们仍然保留了整本书中介绍的训练函数、损失函数和优化的数学结构。然而,与前几章不同的是,我们的目标是学习概率分布,而不是确定性函数。首要主题是有训练数据,我们希望提出一个数学模型来生成与其类似的新数据。

In this chapter, we still have the mathematical structure of training function, loss function, and optimization presented throughout the book. However, unlike in the first few chapters, we aim to learn probability distributions, instead of deterministic functions. The overarching theme is that there is training data, and we want to come up with a mathematical model that generates new data similar to it.

那里有两个感兴趣的数量:

There are two quantities of interest:

  • 输入数据特征的真实(和未知)联合概率分布 p dAtA X

  • The true (and unknown) joint probability distribution of the features of the input data p data ( x ) .

  • 数据特征与模型参数的模型联合概率分布: p de X ; θ

  • The model joint probability distribution of the features of the data along with the parameters of the model: p model ( x ; θ ) .

理想情况下,我们希望这两者尽可能接近。在实践中,我们满足参数值 θ 允许 p de X ; θ 适合我们的特定用例。

Ideally, we want these two as close as possible. In practice, we settle for parameter values θ that allow p model ( x ; θ ) to work well for our particular use cases.

在整章中,我们使用概率分布的三个规则:

Throughout the chapter, we make use of three rules for probability distributions:

  1. 将多变量联合概率分布分解为单变量条件概率分布的乘积的乘积规则。

  2. The product rule that decomposes the multivariable joint probability distribution into a product of single variable conditional probability distributions.

  3. 贝叶斯规则,它允许我们在变量之间无缝切换。

  4. Bayes’ Rule, which allows us to flip between variables seamlessly.

  5. 对特征或潜在(隐藏)变量的独立性或条件独立性假设,使我们能够进一步简化单变量条件概率的乘积。

  6. Independence or conditional independence assumptions on the features or on latent (hidden) variables, which allow us to simplify the product of single variable conditional probabilities even further.

在前面的章节中,我们正在最小化损失函数。在本章中,类似的函数是对数似然函数,并且优化过程总是尝试最大化该对数似然函数(注意,我们不是最小化损失函数,而是最大化目标函数)。很快就会有更多相关内容。

In previous chapters we were minimizing the loss function. In this chapter the analogous function is the log likelihood function, and the optimization process always attempts to maximize this log likelihood (careful, we are not minimizing a loss function, we are maximizing an objective function instead). More on this soon.

在我们深入研究之前,让我们做一个记录,将之前的确定性机器学习模型转化为概率语言。我们之前的模型学习了一个映射输入数据特征的训练函数 X 到输出 y (目标或标签),或 F X ; θ = y 。当我们的目标是分类时,f返回概率最高的标签y 。也就是说,分类器从输入数据中学习直接映射 X 类标签y;换句话说,他们对后验概率进行建模 p y | X 直接地。我们稍后会详细说明这一点在章节中。

Before we dive in, let’s make a note that puts our previous deterministic machine learning models into probability language. Our previous models learned a training function that mapped the features of the input data x to an output y (target or label), or f ( x ; θ ) = y . When our goal was classification, f returned the label y that had the highest probability. That is, a classifier learns a direct map from input data x to class labels y; in other words, they model the posterior probability p ( y | x ) directly. We will elaborate on this later in the chapter.

生成模型有什么用?

What Are Generative Models Useful For?

生成模型使得模糊真实数据和计算机生成数据之间的界限成为可能。他们一直在进步,并取得了令人印象深刻的成功:机器生成的图像,包括人类的图像,越来越真实。很难判断时尚界模特的图像是真人的图像还是生成机器学习模型的输出。

Generative models have made it possible to blur the lines between true and computer-generated data. They have been improving and are achieving impressive successes: machine-generated images, including those of humans, are increasingly more realistic. It is hard to tell whether an image of a model in the fashion industry is that of a real person or the output of a generative machine learning model.

生成模型的目标是使用机器生成新颖的数据,例如包含语音、图像、视频或自然语言文本的音频波形。生成模型从学习的概率分布中采样数据,其中样本尽可能地模仿现实。这里的假设是,我们想要模仿的现实数据背后存在一些未知的概率分布(否则我们的整个现实将是一些随机的混沌噪声,缺乏任何连贯性或结构),并且模型的目标是学习近似值使用训练数据计算该概率分布。

The goal of a generative model is to use a machine to generate novel data, such as audio waveforms containing speech, images, videos, or natural language text. Generative models sample data from a learned probability distribution, where the samples mimic reality as much as possible. The assumption here is that there is some unknown probability distribution underlying the real-life data that we want to mimic (otherwise our whole reality will be some random chaotic noise, lacking any coherence or structure), and the model’s goal is to learn an approximation of this probability distribution using the training data.

从特定领域收集大量数据后,我们训练生成模型来生成与收集的数据相似的数据。收集的数据可以是数百万张图像或视频、数千条录音或整个自然语言语料库。

After collecting a large amount of data from a specific domain, we train a generative model to generate data similar to the collected data. The collected data can be millions of images or videos, thousands of audio recordings, or entire corpuses of natural language.

生成模型可用于许多应用,包括在数据稀缺且需要更多数据时增强数据、为更高分辨率的图像输入缺失值,以及在只有很少标签可用时模拟用于强化学习或半监督学习的新数据。另一个应用是图像到图像的转换,例如将航拍图像转换为地图或将手绘草图转换为图像。更多应用包括图像去噪、修复、超分辨率和图像编辑,例如让笑容更宽、颧骨更高、脸更瘦。

Generative models are useful for many applications, including augmenting data when data is scarce and more of it is needed, inputting missing values for higher-resolution images, and simulating new data for reinforcement learning or for semi-supervised learning when only few labels are available. Another application is image-to-image translation, such as converting aerial images into maps or converting hand-drawn sketches to images. More applications include image denoising, inpainting, super-resolution, and image editing, such as making smiles wider, cheekbones higher, and faces slimmer.

此外,生成模型的建立是为了通过从所需的概率分布中抽取多个样本来生成多个可接受的输出。这与我们的确定性模型不同,我们的确定性模型在训练期间使用均方误差损失函数或其他平均损失函数对具有不同特征的输出进行平均。这里的缺点是生成模型也可能会抽取一些不好的样本。

Moreover, generative models are built to generate more than one acceptable output by drawing multiple samples from the desired probability distribution. This is different than our deterministic models that average over the output with different features during training using a mean squared error loss function or some other averaging loss function. The downside here is that a generative model can draw some bad samples as well.

一种生成模型,生成对抗网络(由 Ian Goodfellow 等人于 2014 年发明),具有令人难以置信的前景并且具有广泛的应用,从增强数据集到完成蒙面人脸到天体物理学和高能物理学,例如模拟类似于这些是在欧洲核子研究中心大型强子对撞机产生的,或者模拟暗物质的分布和预测引力透镜效应。生成对抗模型建立了两个神经网络,它们在零和游戏中相互竞争(想想数学中的博弈论),直到机器本身无法区分真实图像和计算机生成的图像。这就是为什么他们的输出看起来非常接近现实。

One type of generative model, namely generative adversarial networks (invented in 2014 by Ian Goodfellow et al.), are incredibly promising and have a wide range of applications, from augmenting data sets to completing masked human faces to astrophysics and high energy physics, such as simulating data sets similar to those produced at the CERN Large Hadron Collider, or simulating distribution of dark matter and predicting gravitational lensing. Generative adversarial models set up two neural networks that compete against each other in a zero-sum game (think game theory in mathematics) until the machine itself cannot tell the difference between a real image and a computer-generated one. This is why their outputs seem very close to reality.

第7章,其中主要面向自然语言处理,在没有明确指出的情况下尝试使用生成模型。自然语言处理的大多数应用不是简单的分类模型(垃圾邮件或非垃圾邮件、积极情绪或消极情绪以及词性标记),都包括语言生成。这些例子是我们的智能手机或电子邮件上的自动完成、机器翻译、文本摘要、聊天机器人以及图像字幕。

Chapter 7, which was heavily geared toward natural language processing, flirted with generative models without explicitly pointing them out. Most applications of natural language processing, which are not simple classification models (spam or not spam, positive sentiment or negative sentiment, and part of speech tagging), include language generation. Such examples are autocomplete on our smartphones or email, machine translation, text summarization, chatbots, and image captioning.

生成模型的典型数学

The Typical Mathematics of Generative Models

生成模型通过概率分布来感知和表示世界。也就是说,彩色图像是来自像素联合概率分布的一个样本,这些像素一起形成有意义的图像(尝试计算包含所有红色、绿色和蓝色通道的这种联合概率分布的维度)、音频波是音频信号联合概率分布中的一个样本,这些音频信号共同构成了有意义的声音(这些声音也是极高维的),而句子是来自共同表示连贯句子的单词或字符的联合概率分布中的一个样本。

Generative models perceive and represent the world through probability distributions. That is, a color image is one sample from the joint probability distribution of pixels that together form a meaningful image (try to count the dimensions of such a joint probability distribution with all the red, green, and blue channels included), an audio wave is one sample from the joint probability distribution of audio signals that together make up meaningful sounds (these are also extremely high dimensional), and a sentence is one sample from the joint probability distribution of words or characters that together represent coherent sentences.

那么,最突出的问题是:我们如何计算这些具有惊人代表性的联合概率分布,这些分布能够捕捉我们周围世界的复杂性,但遗憾的是它们的维度却非常高?

The glaring question is then: how do we compute these amazingly representative joint probability distributions that are able to capture the complexity of the world around us, but sadly happen to be extremely high dimensional?

此时机器学习的答案是可以预测的。从我们知道的简单概率分布(例如高斯分布)开始,然后找到一种方法将其塑造成另一个分布,该分布非常接近手头数据的经验分布。但我们如何将一种发行版塑造成另一种发行版呢?我们可以对其概率密度应用确定性函数。所以我们必须明白以下几点:

The machine learning answer is predictable at this point. Start with an easy probability distribution that we know of, such as the Gaussian distribution, then find a way to mold it into another distribution that well approximates the empirical distribution of the data at hand. But how do we mold one distribution into another? We can apply a deterministic function to its probability density. So we must understand the following:

我们如何将确定性函数应用于概率分布,以及所得随机变量的概率分布是什么?
How do we apply a deterministic function to a probability distribution, and what is the probability distribution of the resulting random variable?

我们使用以下变换公式:

p X X = p z G -1 X | 德特 G -1 X X |

这在许多概率书籍中都有很好的记录,我们很快就会从那里提取我们需要的内容。

We use the following transformation formula:

p x ( x ) = p z ( g -1 ( x ) ) | det ( g -1 x x ) |

This is very well documented in many probability books and we will extract what we need from there shortly.

我们必须应用的正确函数是什么?
What is the correct function that we must apply?

一种方法是训练我们的模型来学习它。我们现在知道神经网络有能力表示广泛的函数,因此我们可以通过神经网络传递我们开始时的简单概率分布(神经网络将是我们正在寻找的确定性函数的公式),然后我们通过最小化给定数据的经验分布与网络输出的分布之间的误差来学习网络的参数。

One way is to train our model to learn it. We now know that neural networks have the capacity to represent a wide range of functions, so we can pass the simple probability distribution that we start with through a neural network (the neural network would be the formula of the deterministic function that we are looking for), then we learn the network’s parameters by minimizing the error between the empirical distribution of the given data and the distribution output by the network.

我们如何衡量概率分布之间的误差?
How do we measure errors between probability distributions?

概率论为我们提供了一些衡量两个概率分布如何相互偏离的方法,例如 Kullback-Leibler (KL) 散度。这也和信息论中的交叉熵有关。

Probability theory provides us with some measures of how two probability distributions diverge from each other, such as the Kullback-Leibler (KL) divergence. This is also related to cross-entropy from information theory.

所有生成模型都是这样工作的吗?
Do all generative models work this way?

是和不是。是的,因为他们都在尝试学习可能生成训练数据的联合概率分布。换句话说,生成模型尝试学习最大化训练数据的可能性(或最大化模型分配给训练数据的概率)的联合概率分布的公式和参数。,我们只是概述了一种明确的方法来近似我们期望的联合概率分布。这是一种思想流派。一般来说,定义明确且易于处理的概率密度函数的模型允许我们直接对训练数据的对数似然进行操作,计算其梯度,并应用可用的优化算法来搜索最大值。还有其他模型提供明确但难以处理的概率密度函数,在这种情况下,我们必须使用近似值来最大化可能性。我们如何近似解决优化问题?我们可以使用依赖于变分方法(变分自动编码器模型)的确定性近似,或者使用依赖于马尔可夫链蒙特卡罗方法的随机近似。最后,有隐式方法来近似我们期望的联合概率分布。隐式模型学习从未知分布中采样,而无需明确为其定义公式。生成对抗网络属于这一类。

Yes and no. Yes in the sense that they are all trying to learn the joint probability distribution that presumably generated the training data. In other words, generative models attempt to learn the formula and the parameters of a joint probability distribution that maximizes the likelihood of the training data (or maximizes the probability that the model assigns to the training data). No in the sense that we only outlined an explicit way to approximate our desired joint probability distribution. This is one school of thought. In general, a model that defines an explicit and tractable probability density function allows us to operate directly on the log-likelihood of the training data, compute its gradient, and apply available optimization algorithms to search for the maximum. There are other models that provide an explicit but intractable probability density function, in which case we must use approximations to maximize the likelihood. How do we solve an optimization problem approximately? We can either use a deterministic approximation, relying on variational methods (variational autoencoder models), or use a stochastic approximation, relying on Markov chain Monte Carlo methods. Finally, there are implicit ways to approximate our desired joint probability distribution. Implicit models learn to sample from the unknown distribution without ever explicitly defining a formula for it. Generative adversarial networks fall into this category.

如今,三种最流行的生成建模方法是:

Nowadays, the three most popular approaches to generative modeling are:

生成对抗网络
Generative adversarial networks

这些是隐式密度模型。

These are implicit density models.

提供明确但棘手的概率密度函数的变分模型
Variational models that provide an explicit but intractable probability density function

我们在概率图模型的框架内近似优化问题的解决方案,其中我们最大化数据的对数似然的下界,因为立即最大化数据的对数似然是棘手的。

We approximate the solution of the optimization problem within the framework of probabilistic graphical models, where we maximize a lower bound on the log-likelihood of the data, since immediately maximizing the log-likelihood of the data is intractable.

完全可见的信念网络
Fully visible belief networks

这些提供明确且易于处理的概率密度函数,例如Pixel Convolutional Neural Networks (PixelCNN) 2016WaveNet (2016)。这些模型通过将联合概率分布分解为每个单独维度的一维概率分布的乘积(以其之前的概率分布为条件)来学习联合概率分布,并一次学习每个分布。这种分解得益于概率的乘积法则或链式法则。例如,PixelCNN 训练的网络可以学习给定先前像素(左侧和顶部)的图像中每个单独像素的条件概率分布,而 WaveNet 训练的网络可以学习每个单独音频的条件概率分布声波中的信号以其之前的信号为条件。这里的缺点是这些模型一次只生成一个样本,并且不允许并行化。这大大减慢了生成过程。例如,WaveNet 需要两分钟的计算时间才能生成一秒的音频,因此我们不能将其用于实时来回对话。

These provide explicit and tractable probability density functions, such as Pixel Convolutional Neural Networks (PixelCNN) 2016 and WaveNet (2016). These models learn the joint probability distribution by decomposing it into a product of one-dimensional probability distributions for each individual dimension, conditioned on those that preceded it, and learning each of these distributions one at a time. This decomposition is thanks to the product rule or chain rule for probabilities. For example, PixelCNN trains a network that learns the conditional probability distribution of every individual pixel in an image given previous pixels (to the left and to the top of it), and WaveNet trains a network that learns the conditional probability distribution of every individual audio signal in a sound wave conditioned on those that preceded it. The drawbacks here are that these models generate the samples only one entry at a time, and they disallow parallelization. This slows down the generation process considerably. For example, it takes WaveNet two minutes of computation time to generate one second of audio, so we cannot use it for live back-and-forth conversations.

还有其他生成模型属于上述类别,但由于昂贵的计算要求或选择密度函数和/或其变换的困难而不太受欢迎。其中包括需要更改变量的模型,例如非线性独立分量估计(显式且易于处理的密度模型)、玻尔兹曼机模型(显式且易于处理的密度模型,采用随机马尔可夫链逼近来解决最大化问题),以及生成随机网络模型(隐式密度模型,再次依赖于马尔可夫链来达到其近似最大似然)。我们在本章末尾简要回顾了这些模型。在实践中,远离数学理论和分析,马尔可夫链方法由于其计算成本和不愿收敛而失宠迅速。

There are other generative models that fall into the above categories but are less popular, due to expensive computational requirements or difficulties in selecting the density function and/or its transformations. These include models that require a change of variables, such as nonlinear independent component estimation (explicit and tractable density model), Boltzmann machine models (explicit and intractable density model, with a stochastic Markov chain approximation to the solution of the maximization problem), and generative stochastic network models (implicit density model, again depending on a Markov chain to arrive at its approximate maximum likelihood). We survey these models briefly toward the end of this chapter. In practice and away from mathematical theory and analysis, Markov chain approaches are out of favor due to their computational cost and reluctance to converge rapidly.

将我们的大脑从确定性思维转变为概率性思维

Shifting Our Brain from Deterministic Thinking to Probabilistic Thinking

在这个在本章中,我们正在慢慢地将我们的大脑从确定性思维转变为概率性思维。到目前为止,在本书中,我们仅使用确定性函数来进行预测。训练函数是数据特征的线性组合,有时由非线性激活器组成,损失函数是真实值和预测值之间的确定性判别器,优化方法基于确定性梯度下降方法。仅当我们需要降低模型确定性组件的计算成本(例如随机梯度下降或随机奇异值分解)时,才会引入随机性或随机性;当我们将数据集分为训练、验证和测试子集时;当我们选择小批量时;当我们遍历一些超参数空间时;或者当我们将数据样本的分数传递到 softmax 函数(这是一个确定性函数)并将结果值解释为概率时。在所有这些设置中,随机性和相关的概率分布仅与模型的特定组件相关,仅作为达到目的的手段:实现确定性模型的实际实现和计算。它们从来不构成模特的核心妆容。

In this chapter, we are slowly shifting our brains from deterministic thinking to probabilistic thinking. So far in this book, we have only used deterministic functions to make our predictions. The training functions were linear combinations of data features, sometimes composed with nonlinear activators, the loss functions were deterministic discriminators between the true values and the predicted ones, and the optimization methods were based on deterministic gradient descent methods. Stochasticity, or randomness, was only introduced when we needed to make the computations of the deterministic components of our model less expensive, such as stochastic gradient descent or stochastic singular value decomposition; when we split our data sets into training, validation, and test subsets; when we selected our minibatches; when we traversed some hyperparameter spaces; or when we passed the scores of data samples into the softmax function, which is a deterministic function, and interpreted the resulting values as probabilities. In all of these settings, stochasticity and the associated probability distributions related only to specific components of the model, serving only as a means to an end: enabling the practical implementation and computation of the deterministic model. They never constituted a model’s core makeup.

生成模型与我们在前面章节中看到的模型不同,因为它们的核心是概率性的。尽管如此,我们仍然拥有训练、损失和优化结构,只不过现在模型学习的是概率分布(显式或隐式),而不是学习确定性函数。然后,我们的损失函数测量真实概率分布和预测概率分布之间的误差(至少对于显式密度模型而言),因此我们必须了解如何定义和计算概率而不是确定性值之间的某种误差函数。我们还必须学习如何在这种概率设置中优化和求导。

Generative models are different than the models that we have seen in previous chapters in the sense that they are probabilistic at their core. Nevertheless, we still have the training, loss, and optimization structure, except that now the model learns a probability distribution (explicitly or implicitly) as opposed to learning a deterministic function. Our loss function then measures the error between the true and the predicted probability distributions (at least for the explicit density models), so we must understand how to define and compute some sort of error function between probabilities instead of deterministic values. We must also learn how to optimize and take derivatives in this probabilistic setting.

在数学中,评估给定函数(正向问题)比求其反函数(逆向问题)要容易得多,更不用说当我们只能访问函数值的少数观察值(例如我们的数据样本)时。在我们的概率设置中,前向问题如下所示:给定一定的概率分布,对一些数据进行采样。逆问题是我们关心的问题:给定我们不知道的概率分布的有限数量的实现(数据样本),找到最有可能生成它们的概率分布。我想到的一个困难是唯一性问题:可能有不止一种分布适合我们的数据。此外,反演问题通常要困难得多,因为本质上我们必须向后行动并撤消前向函数得出给定观察值所遵循的过程。问题是,大多数过程都无法撤销,而且这在某种程度上比我们更大,嵌入在自然法则中:宇宙倾向于增加熵。除了解决逆问题固有的困难之外,我们通常尝试估计人工智能应用程序的概率分布是高维的,有很多变量,我们甚至不确定我们的概率模型是否已经考虑了所有变量(但是对于确定性模型来说也是有问题的)。这些困难不应该阻止我们。表示和操纵高维概率分布对于许多数学、科学、金融、工程和其他应用都很重要。我们必须深入研究生成模型。

In mathematics, it is a much easier problem to evaluate a given function (forward problem) than to find its inverse (inverse problem), let alone when we only have access to a few observations of the function values, such as our data samples. In our probabilistic setting, the forward problem looks like this: given a certain probability distribution, sample some data. The inverse problem is the one we care about: given this finite number of realizations (data samples) of a probability distribution that we do not know, find the probability distribution that most likely generated them. One difficulty that comes to mind is the issue of uniqueness: there could be more than one distribution that fits our data. Moreover, the inverse problem is usually much harder because in essence we have to act backward and undo the process that the forward function followed to arrive at the given observations. The issue is that most processes cannot be undone, and this is somehow bigger than us, embedded in the laws of nature: the universe tends to increase entropy. On top of the hardship inherent in solving inverse problems, the probability distributions that we usually try to estimate for AI applications are high dimensional, with many variables, and we are not even sure that our probabilistic model has accounted for all the variables (but that is problematic for deterministic models as well). These difficulties should not deter us. Representing and manipulating high-dimensional probability distributions is important for many math, science, finance, engineering, and other applications. We must dive into generative models.

在本章的其余部分中,我们将区分使用显式公式给出估计概率分布和没有公式而是从隐式分布以数字方式生成新数据样本时的情况。请注意,在前面的章节中,对于我们所有的确定性模型,我们总是有明确的训练函数公式,包括决策树、全连接神经网络和卷积神经网络给出的公式。当时,一旦我们从数据中估计出这些确定性函数,我们就可以回答这样的问题:目标变量的预测值是多少?在概率模型中,我们回答一个不同的问题:目标变量假设某个值或位于某个区间内的概率是多少?不同之处在于,我们不知道模型如何组合变量来产生结果,就像在确定性情况下一样。我们在概率模型中尝试估计的是模型变量与目标变量一起出现的概率(它们的联合概率),理想情况下对于所有变量的所有范围。这将为我们提供目标变量的概率分布,而无需明确表述模型变量如何相互作用以产生此结果。这纯粹取决于关于观察数据。

Throughout the rest of this chapter, we will differentiate the case when our estimated probability distribution is given with an explicit formula, and when we do not have a formula but instead we numerically generate new data samples from an implicit distribution. Note that in the previous chapters, with all of our deterministic models, we always had explicit formulas for our training functions, including the ones given by decision trees, fully connected neural networks, and convolutional neural networks. Back then, once we estimated these deterministic functions from the data, we could answer questions like: what is the predicted value of the target variable? In probabilistic models, we answer a different question: what is the probability that the target variable assumes a certain value, or lies in a certain interval? The difference is that we do not know how our model combined the variables to produce our result, as in the deterministic case. What we try to estimate in probabilistic models is the probability that the model’s variables occur together with the target variable (their joint probability), ideally for all ranges of all variables. This will give us the probability distribution of the target variable, without having to explicitly formulate how the model’s variables interact to produce this result. This purely depends on observing the data.

最大似然估计

Maximum Likelihood Estimation

许多生成模型直接或间接依赖于最大似然原理。对于概率模型,目标是学习近似观察数据真实概率分布的概率分布。实现此目的的一种方法是指定显式概率分布 p de X ; θ 有一些未知的参数 θ ,然后求解参数 θ 这使得训练数据集尽可能容易被观察到。也就是说,我们需要找到 θ 最大化训练数据的可能性,为这些样本分配高概率。如果有m 个训练数据点,我们假设它们是独立采样的,因此一起观察它们的概率只是所有单个样本概率的乘积。所以我们有:

Many generative models either directly or indirectly rely on the maximum likelihood principle. For probabilistic models, the goal is to learn a probability distribution that approximates the true probability distribution of the observed data. One way to do this is to specify an explicit probability distribution p model ( x ; θ ) with some unknown parameters θ , then solve for the parameters θ that make the training data set as likely to be observed as possible. That is, we need to find the θ that maximizes the likelihood of the training data, assigning a high probability for these samples. If there are m training data points, we assume that they are sampled independently, so that the probability of observing them together is just the product of the probabilities of all the individual samples. So we have:

θ ptA = 精氨酸 最大限度 θ p de X 1 ; θ p de X 2 ; θ p de X ; θ

回想一下,每个概率都是 0 到 1 之间的数字。如果我们将所有这些概率相乘,我们将得到数量级极小的数字,这会引入数值不稳定并存在下溢的风险(当机器将非常小的数字存储为零时,实质上会删除所有有效数字)。log函数总是解决这个问题,将所有数值极大或极小的数字转换回合理的数值范围。好消息是概率的对数变换不会影响最优值 θ ,因为对数函数是递增函数。也就是说,如果 F θ ptA F θ 对全部 θ , 然后 日志 F θ ptA 日志 F θ 对全部 θ 以及。与递增函数组合不会改变不等号。关键是最大似然解变得等同于最大对数似然解。现在回想一下,对数函数将乘积转换为和,因此我们有:

Recall that each probability is a number between zero and one. If we multiply all of these probabilities together, we would obtain numbers extremely small in magnitude, which introduces numerical instabilities and runs the risk of underflow (when the machine stores a very small number as zero, essentially removing all significant digits). The log function always solves this problem, transforming all numbers whose magnitude is extremely large or extremely small back to the reasonable magnitude realm. The good news is that the log transformation for our probabilities does not affect the values of the optimal θ , since the log function is an increasing function. That is, if f ( θ optimal ) f ( θ ) for all θ , then log ( f ( θ optimal ) ) log ( f ( θ ) ) for all θ as well. Composing with increasing functions does not change the inequality sign. The point is that the maximum likelihood solution becomes equivalent to the maximum log-likelihood solution. Now recall that the log function transforms products to sums, so we have:

θ ptA = 精氨酸 最大限度 θ 日志 p de X 1 ; θ + 日志 p de X 2 ; θ + + 日志 p de X ; θ

请注意,这个表达式想要增加每个 p de X , θ 对于每个数据样本。也就是说,它更喜欢的值 θ 将图形向上推 p de X , θ 在每个数据点上方 X 。然而,我们不能无限期地向上推。必须有一个向下补偿,因为图形下方区域的超面积必须加起来为 1,知道 p de X , θ 是一个概率分布。

Note that this expression wants to increase each of p model ( x , θ ) for each data sample. That is, it prefers that the values of θ push up the graph of p model ( x , θ ) above each data point x i . However, we cannot push up indefinitely. There must be a downward compensation since the hyper-area of the region under the graph has to add up to 1, knowing that p model ( x , θ ) is a probability distribution.

我们可以根据期望和条件概率重新表述我们的表达式:

We can reformulate our expression in terms of expectation and conditional probabilities:

θ ptA = 精氨酸 最大限度 θ 𝔼 Xp dAtA 日志 p de X | θ

我们在前面的章节中讨论的确定性模型通过最小化损失函数来找到模型的参数(或权重),该损失函数测量模型的预测与数据标签提供的真实值之间的误差,或者换句话说, y de y dAtA 。在本章中,我们关心的是找到最大化数据对数似然的参数。如果有一个对数似然最大化的公式,类似于最小化测量概率分布之间误差的量,那就太好了 p de p dAtA ,因此本章与前面各章之间的类比是显而易见的。幸运的是,有。最大似然估计与最小化生成数据的概率分布与模型概率分布之间的Kullback-Leibler (KL) 散度:

The deterministic models that we discussed in the previous chapters find the models’ parameters (or weights) by minimizing a loss function that measures the error between the models’ predictions and the true values provided by the data labels, or in other words, between y model and y data . In this chapter, we care about finding the parameters that maximize the log-likelihood of the data. It would be nice if there was a formulation of log-likelihood maximization that is analogous to minimizing a quantity that measures an error between the probability distributions p model and p data , so that the analogy between this chapter and the previous chapters is obvious. Luckily, there is. The maximum likelihood estimation is the same as minimizing the Kullback-Leibler (KL) divergence between the probability distribution that generated the data and the model’s probability distribution:

θ ptA = 精氨酸 分钟 θ D v e r G e n C e KL p dAtA X | | p de X ; θ

如果 p dAtA 恰好是发行版家族的成员 p de X ; θ ,如果我们能够精确地执行最小化,那么我们将恢复生成数据的精确分布,即 p dAtA 。然而,在实践中,我们无法访问数据生成的分布;事实上,这就是我们试图近似的分布。我们只能访问m 个样本 p dAtA 。这些样本定义了经验分布 p ^ dAtA 只将质量放在这m个样本上。现在最大化训练集的对数似然完全等同于最小化之间的 KL 散度 p ^ dAtA p de X ; θ :

If p data happens to be a member of the family of distributions p model ( x ; θ ) , and if we were able to perform the minimization precisely, then we would recover the exact distribution that generated the data, namely p data . However, in practice, we do not have access to the data-generating the distribution; in fact, it is the distribution that we are trying to approximate. We only have access to m samples from p data . These samples define the empirical distribution p ^ data that places mass only on exactly these m samples. Now maximizing the log-likelihood of the training set is exactly equivalent to minimizing the KL divergence between p ^ data and p model ( x ; θ ) :

θ ptA = 精氨酸 分钟 θ D v e r G e n C e KL p ^ dAtA X | | p de X ; θ

此时,我们可能会对三个实际上在数学上等效的优化问题感到困惑,它们只是恰好来自数学、统计学、自然科学和计算机科学的不同子学科和亚文化:

At this point we might be confused between three optimization problems that are in fact mathematically equivalent, they just happen to come from different subdisciplines and subcultures of mathematics, statistics, natural sciences, and computer science:

  • 最大化训练数据的对数似然

  • Maximizing the log-likelihood of the training data

  • 最小化训练数据的经验分布与模型分布之间的 KL 散度

  • Minimizing the KL divergence between the empirical distribution of the training data and the model’s distribution

  • 当我们使用 softmax 函数的组合来分类为多个类别时,最小化训练数据标签和模型输出之间的交叉熵损失函数。

  • Minimizing the cross-entropy loss function between the training data labels and the model outputs, when we are classifying into multiple classes using composition with the softmax function.

不要混淆。最小化 KL 散度的参数与最小化交叉熵的参数相同,负对数似然。

Do not be confused. The parameters that minimize the KL divergence are the same as the parameters that minimize the cross-entropy and the negative log-likelihood.

显式和隐式密度模型

Explicit and Implicit Density Models

最大对数似然估计(或最小 KL 散度)的目标是找到概率分布 p de X ; θ 这最好地解释了观察到的数据。生成模型使用这个学习的 p de X ; θ 生成新数据。这里有两种方法,一种是显式的,另一种是隐式的:

The goal of maximum log-likelihood estimation (or minimum KL divergence) is to find a probability distribution p model ( x ; θ ) that best explains the observed data. Generative models use this learned p model ( x ; θ ) to generate new data. There are two approaches here, one explicit and the other implicit:

显式密度模型
Explicit density models

定义概率分布的公式明确地表示为 X θ ,然后找到的值 θ 通过遵循梯度向量(关于分量的偏导数)最大化训练数据样本的对数似然 θ )上坡。这里一个明显的困难是提出一个概率密度公式,该公式能够捕获数据的复杂性,同时保持友好地计算其梯度的对数似然。

Define the formula for the probability distribution explicitly in terms of x and θ , then find the values of θ that maximize the log-likelihood of the training data samples by following the gradient vector (the partial derivatives with respect to the components of θ ) uphill. One glaring difficulty here is coming up with a formula for the probability density that is able to capture the complexity in the data, while at the same time staying amiable to computing the log-likelihood with its gradient.

隐式密度模型
Implicit density models

样本直接来自 p de X ; θ 无需为此分布编写公式。生成随机网络基于马尔可夫链框架来实现这一点,但该框架收敛速度慢,因此在实际应用中不受欢迎。使用这种方法,模型随机变换现有样本以从同一分布中获取另一个样本。生成对抗网络与模型的概率分布间接交互,而无需明确定义它。他们在两个网络之间建立了一种零和游戏,其中一个网络生成样本,另一个网络充当分类器,确定生成的样本是否来自正确的分布。

Sample directly from p model ( x ; θ ) without ever writing a formula for this distribution. Generative stochastic networks do this based on a Markov chain framework, which is slow to converge and thus unpopular for practical applications. Using this approach, the model stochastically transforms an existing sample to obtain another sample from the same distribution. Generative adversarial networks interact indirectly with the model’s probability distribution without explicitly defining it. They set up a zero-sum game between two networks, where one network generates a sample and the other network acts like a classifier determining whether the generated sample is from the correct distribution or not.

显式密度易于处理:完全可见的信念网络

Explicit Density-Tractable: Fully Visible Belief Networks

这些模型承认具有易于处理的对数似然优化的显式概率密度函数。他们依靠概率的链式法则来分解联合概率分布 p de X 转化为一维概率分布的乘积:

These models admit an explicit probability density function with tractable log-likelihood optimization. They rely on the chain rule of probability to decompose the joint probability distribution p model ( x ) into a product of one-dimensional probability distributions:

p de X = =1 n p de X | X 1 , X 2 , , X -1

这里的主要缺点是样本必须一次生成一个组成部分(图像的一个像素,或者单词的一个字符,或者离散音频波的一个条目),因此,生成一个样本的成本是 O ( n) .

The main drawback here is that samples must be generated one component at a time (one pixel of an image, or one character of a word, or one entry of a discrete audio wave), therefore, the cost of generating one sample is O(n).

示例:通过 PixelCNN 生成图像并通过 WaveNet 生成机器音频

Example: Generating Images via PixelCNN and Machine Audio via WaveNet

PixelCNN训练卷积神经网络,在给定先前像素(目标像素的左侧和顶部)的情况下,对每个单独像素的条件分布进行建模。图 8-1说明了这一点。

PixelCNN trains a convolutional neural network that models the conditional distribution of every individual pixel, given previous pixels (to the left and to the top of the target pixel). Figure 8-1 illustrates this.

WaveNet 训练一个卷积神经网络,在给定先前的条目的情况下,对音频波的每个条目的条件分布进行建模。我们只会详细阐述 WaveNet。它是 PixelCNN 的一维模拟,并捕捉了基本思想。

WaveNet trains a convolutional neural network that models the conditional distribution of each entry of an audiowave, given the previous entries. We will only elaborate on WaveNet. It is the one-dimensional analog of PixelCNN and captures the essential ideas.

WaveNet 的目标是生成宽带原始音频波形。所以我们必须学习音频波形的联合概率分布 X = X 1 , X 2 , , X 时间 来自某种类型。

The goal of WaveNet is to generate wideband raw audio waveforms. So we must learn the joint probability distribution of an audio waveform x = ( x 1 , x 2 , , x T ) from a certain genre.

我们使用乘积规则将联合分布分解为单变量分布的乘积,其中我们将音频波形的每个条目以其之前的条目为条件:

We use the product rule to decompose the joint distribution into a product of single variable distributions where we condition each entry of the audio waveform on those that preceded it:

p de X = t=1 时间 p de X t | X 1 , X 2 , , X t-1
埃麦0801
图 8-1。PixelCNN 学习以前 n-1 个像素为条件的第 n 个像素的条件分布(图像源

一个困难是音频波形具有非常高的时间分辨率,每 1 秒音频至少有 16,000 个条目(因此一分钟长的一个数据样本是一个T = 960,000 个条目的向量)。每个条目代表离散化原始音频的一个时间步长,通常存储为 16 位整数。也就是说,每个条目可以采用 0 到 65,535 之间的任何值。如果我们保持这个范围,网络必须学习每个条目的概率,因此输出级别的 softmax 函数必须为每个条目输出 65,536 个概率分数。我们必须执行此操作的条目总数以及网络本身的计算复杂性变得非常昂贵。为了使这个更容易处理,我们必须量化,这在电子学中意味着通过幅度被限制在一组规定值的信号来近似连续变化的信号。WaveNet 转换原始数据,将每个条目的值限制为 256 个选项,范围从 0 到 255,类似于数字图像的像素范围。现在,在训练期间,网络必须在给定前面的条目的情况下学习这 256 个值中每个条目的概率分布,并且在音频生成期间,它从这些学习的分布中一次采样一个条目。

One difficulty is that audio waveforms have very high temporal resolution, with at least 16,000 entries per 1 second of audio (so one data sample that is a minute long is a vector with T = 960,000 entries). Each entry represents one time step of discretized raw audio, and is usually stored as a 16-bit integer. That is, each entry can assume any value between 0 and 65,535. If we keep this range, the network has to learn the probability for each entry, so the softmax function at the output level has to output 65,536 probability scores for every single entry. The total number of entries we have to do this for, along with the computational complexity of the network itself, become very expensive. To make this more tractable, we must quantize, which in electronics means approximate a continuously varying signal by one whose amplitude is restricted to a prescribed set of values. WaveNet transforms the raw data to restrict the entries’ values to 256 options each, ranging from 0 to 255, similar to the pixel range for digital images. Now, during training the network must learn the probability distribution of each entry over these 256 values, given the preceding entries, and during audio generation it samples from these learned distributions one entry at a time.

最后一个复杂之处是,如果音频信号表示任何有意义的东西,那么表示它的向量在多个时间尺度上具有长程依赖性。为了捕获这些长程依赖性,WaveNet 使用扩张卷积。这些是一维内核或过滤器,它们会跳过一些条目以覆盖更广泛的范围,而不增加参数数量(参见图 8-2的说明)。

The last complication is that if the audio signal represents anything meaningful, then the vector representing it has long-range dependencies over multiple time scales. To capture these long-range dependencies, WaveNet uses dilated convolutions. These are one-dimensional kernels or filters that skip some entries to cover a wider range without increasing the number of parameters (see Figure 8-2 for an illustration).

埃麦0802
图 8-2。内核大小等于 2 的扩张卷积。在每一层,内核只有两个参数,但它会跳过更大覆盖范围的条目(具有漂亮动画的图像源)。

另请注意,网络无法窥探未来,因此每一层的过滤器都无法使用训练样本中位于目标条目之前的条目。在一维中,我们只是在每个卷积层更早地停止过滤,因此这是一个简单的时移。在二维中,我们使用屏蔽过滤器,其在中央条目的右侧和底部都有零。

Note also that the network cannot peek into the future, so the filters at each layer cannot use entries from the training sample that are ahead of the target entry. In one dimension we just stop filtering earlier at each convolutional layer, so it is a simple time shift. In two dimensions we use masked filters, which have zeros to the right and to the bottom of the central entry.

WaveNet 总共学习T个概率分布,音频波形的每个条目都有一个概率分布,该概率分布以其前面的条目为条件: p de X 1 , p de X 2 | X 1 , p de X 3 | X 1 , X 2 , d t s p de X 时间 | X 1 , X 2 , , X 时间-1 。在训练期间,可以并行计算这些分布。

WaveNet learns a total of T probability distributions, one for each entry of the audio waveform conditioned on those entries that preceded it: p model ( x 1 ) , p model ( x 2 | x 1 ) , p model ( x 3 | x 1 , x 2 ) , d o t s and p model ( x T | x 1 , x 2 , , x T-1 ) . During training, these distributions can be computed in parallel.

现在假设我们需要在给定之前 99 个条目的情况下学习第 100 个条目的概率分布。我们从训练数据中输入一批音频样本,卷积网络仅使用每个样本的前 99 个条目,计算线性组合(滤波器线性组合),使用非线性激活函数从一层到下一层传递到下一层一些跳过连接和残差层来对抗梯度消失,最后将结果传递给 softmax 函数并输出长度为 256 的向量,其中包含第 100 个条目的值的概率分数。这是模型输出的第 100 个条目的概率分布。将此输出分布与训练批次中第 100 个条目的数据的经验分布进行比较后,调整网络参数以减少误差(降低交叉熵或增加可能性)。随着更多批次的数据和更多的纪元通过网络,给定前 99 个条目,第 100 个条目的概率分布将接近训练数据的经验分布。训练后我们在网络中保存的是参数值。现在我们可以使用经过训练的网络来生成机器音频,一次一个条目:

Now suppose we need to learn the probability distribution of the 100th entry, given the previous 99 entries. We input batches of audio samples from the training data, and the convolutional network uses only the first 99 entries of each sample, computing linear combinations (the filters linearly combine), passing through nonlinear activation functions from one layer to the next to the next using some skip connections and residual layers to battle vanishing gradients, and finally passing the result through a softmax function and outputting a vector of length 256 containing probability scores for the value of the 100th entry. This is the probability distribution for the 100th entry output by the model. After comparing this output distribution with the empirical distribution of the data for the 100th entry from the training batch, the parameters of the network get adjusted to decrease the error (lower the cross-entropy or increase the likelihood). As more batches of data and more epochs pass through the network, the probability distribution for the 100th entry, given the previous 99, will approach the empirical distribution from the training data. What we save within the network after training are the values of the parameters. Now we can use the trained network to generate machine audio, one entry at a time:

  1. 采样一个值 X 1 从概率分布 p de X 1

  2. Sample a value x 1 from the probability distribution p model ( x 1 ) .

  3. 增加 X 1 用零来确定网络输入所需的长度并将向量通过网络。我们将得到输出 p de X 2 | X 1 ,我们可以从中采样 X 2

  4. Augment ( x 1 ) with zeros to establish the required length for the network’s input and pass the vector through the network. We will get as an output p model ( x 2 | x 1 ) , from which we can sample x 2 .

  5. 增加 X 1 , X 2 带零并将向量通过网络传递。我们将得到输出 p de X 3 | X 1 , X 2 ,我们可以从中采样 X 3

  6. Augment ( x 1 , x 2 ) with zeros and pass the vector through the network. We will get as an output p model ( x 3 | x 1 , x 2 ) , from which we can sample x 3 .

  7. 继续前进。

  8. Keep going.

我们可以根据特定的说话人身份来调节 WaveNet,这样我们就可以使用一个模型生成不同的声音。

We can condition WaveNet on a certain speaker identity, so we can generate different voices using one model.

事实上,我们可以并行训练 WaveNet,但只能用它顺序生成音频,这是一个主要缺点。这个问题已经通过Parallel WaveNet得到了纠正,该网络由 Google Assistant 在线部署,包括提供多种英语和日语语音。

The fact that we can train WaveNet in parallel but use it to generate audio only sequentially is a major shortcoming. This has been rectified with Parallel WaveNet, which is deployed online by Google Assistant, including serving multiple English and Japanese voices.

为了总结这一讨论并将其置于与本章相同的数学背景中,PixelCNN 和 WaveNet 是旨在学习某些类型的图像数据或音频数据的联合概率分布的模型。他们通过将联合分布分解为每个数据条目的一维概率分布的乘积来实现这一点,以所有先前的条目为条件。为了找到这些一维条件分布,他们使用卷积网络来学习观察到的条目相互作用以产生下一个条目的分布的方式。这样,网络的输入是确定性的,其输出是概率质量函数。网络本身也是一个确定性函数。我们可以将网络及其输出视为带有我们调整的参数的概率分布。随着训练的发展,输出会得到调整,直到与训练数据的经验分布达到可接受的一致。因此,在我们同意训练数据的分布之前,我们不会将确定性函数应用于概率分布并调整函数的参数。相反,我们从具有许多参数(网络参数)的概率分布的显式公式开始,然后调整参数,直到该显式概率分布与训练数据合理一致。我们对每个对应的条件概率分布执行此操作每个条目。

To summarize and place this discussion in the same mathematical context as this chapter, PixelCNN and WaveNet are models that aim to learn the joint probability distribution of image data or audio data from certain genres. They do so by decomposing the joint distribution into a product of one-dimensional probability distributions for each entry of their data, conditioned on all the preceding entries. To find these one-dimensional conditional distributions, they use a convolutional network to learn the way the observed entries interact together to produce a distribution of the next entry. This way, the input to the network is deterministic, and its output is a probability mass function. The network itself is also a deterministic function. We can view the network together with its output as a probability distribution with parameters that we tweak. As the training evolves, the output gets adjusted until it reaches an acceptable agreement with the empirical distribution of the training data. Therefore, we are not applying a deterministic function to a probability distribution and tweaking the function’s parameters until we agree with the distribution of the training data. We are instead starting with an explicit formula for a probability distribution with many parameters (the network’s parameters), then tweaking the parameters until this explicit probability distribution reasonably agrees with training data. We do this for each conditional probability distribution corresponding to each entry.

显式密度可处理:变量变化非线性独立分量分析

Explicit Density-Tractable: Change of Variables Nonlinear Independent Component Analysis

这里的主要思想是我们有代表观察到的训练数据的随机变量 X 我们想要学习源随机变量 s 产生它的。我们假设存在确定性变换 G s = X 它是可逆且可微的,可以改变未知数 s 对观察到的 X 。那是, s = G -1 X 。现在我们需要找到一个合适的g来找到 的概率分布 s 。此外,我们假设 s 具有独立的条目或分量,因此它的概率分布只不过是其分量分布的乘积。

The main idea here is that we have the random variable representing the observed training data x and we want to learn the source random variable s that generated it. We assume that there is a deterministic transformation g ( s ) = x that is invertible and differentiable that transforms the unknown s to the observed x . That is, s = g -1 ( x ) . Now we need to find an appropriate g to find the probability distribution of s . Moreover, we assume that s has independent entries, or components, so that its probability distribution is nothing but the product of the distributions of its components.

将随机变量的概率分布与其确定性变换的概率分布联系起来的公式是:

The formula that relates the probability distribution of a random variable with the probability distribution of a deterministic transformation of it is:

p s s = p X X × d e t e r n A n t J A C A n = p X G s | 德特 Gs s |

在这里,我们乘以变换的雅可比行列式,说明了变换引起的空间体积变化。

Here we are multiplying by the determinant of the Jacobian of the transformation accounts for the change in volume in space due to the transformation.

非线性独立分量估计将联合概率分布建模为数据的非线性变换 s = G -1 X 。学习变换g使得 G -1 将数据映射到符合因式分解分布的潜在空间;也就是说,映射结果产生独立的潜在变量。转变 G -1 被参数化以允许轻松计算雅可比行列式和逆雅可比行列式。 G -1 基于深度神经网络,其参数是通过优化对数似然来学习的,易于处理。

Nonlinear independent component estimation models the joint probability distribution as the nonlinear transformation of the data s = g -1 ( x ) . The transformation g is learned such that g -1 maps the data to a latent space where it conforms to a factorized distribution; that is, the mapping results in independent latent variables. The transformation g -1 is parameterized to allow for easy computation of the determinant of the Jacobian and the inverse Jacobian. g -1 is based on a deep neural network and its parameters are learned by optimizing the log-likelihood, which is tractable.

请注意,变换g必须可逆的要求意味着潜在变量 s 必须与数据特征具有相同的维度(长度 X )。这对函数g的选择施加了限制,并且是非线性独立分量分析模型的缺点。

Note that the requirement that the transformation g must be invertible means that the latent variables s must have the same dimension as the data features (length of x ). This imposes restrictions on the choice of the function g and is a disadvantage of nonlinear independent component analysis models.

相比之下,生成对抗网络对g 的要求非常少,特别是允许 s 拥有更多尺寸比 X

In comparison, generative adversarial networks impose very few requirements on g, and, in particular, allow s to have more dimensions than x .

显式密度难以处理:通过变分方法进行变分自编码器逼近

Explicit Density-Intractable: Variational Autoencoders Approximation via Variational Methods

确定性自动编码器由一个编码器和一个解码器组成,编码器将数据从x空间映射到较低维度的潜在z空间,而解码器又将数据从z空间映射到 X ^ 空间,目标是不丢失太多信息,或减少重建误差,这意味着保留x X ^ 接近,例如,在欧几里得距离意义上。从这个意义上说,我们可以查看基于奇异值分解的主成分分析 X = U Σ V t ,作为线性编码器,其中解码器只是编码矩阵的转置。编码和解码函数可以是非线性和/或神经网络。

Deterministic autoencoders are composed of an encoder that maps the data from x space to latent z space of lower dimension, and a decoder that in turn maps the data from z space to x ^ space, with the objective of not losing much information, or reducing the reconstruction error, which means keeping x and x ^ close, for example, in the Euclidean distance sense. In this sense, we can view principal component analysis, which is based on the singular value decomposition X = U Σ V t , as a linear encoder, where the decoder is simply the transpose of the encoding matrix. Encoding and decoding functions can be nonlinear and/or neural networks.

对于确定性自动编码器,我们不能将解码器用作数据生成器。至少,如果我们这样做,那么我们必须从潜在z空间中选择一些z并将解码器函数应用于它。我们不太可能得到任何 X ^ 这与所需数据x的外观很接近,除非我们由于过度拟合而选择了与编码x相对应的z 。我们需要一种正则化来为我们提供对z空间的一些控制,从而避免过度拟合并使用自动编码器作为数据生成器。我们通过从确定性自动编码转向概率性自动编码来实现这一点。

For deterministic autoencoders, we cannot use the decoder as a data generator. At least, if we do, then we have to pick some z from latent z space and apply the decoder function to it. We are unlikely to get any x ^ that is close to how the desired data x looks, unless we picked a z that corresponds to a coded x due to overfitting. We need a regularization that provides us with some control over z space, giving us the benefit of avoiding overfitting and using autoencoders as a data generator. We accomplish this by shifting from deterministic autoencoding to probabilistic autoencoding.

变分自动编码器是概率自动编码器:编码器输出潜在空间z上的概率分布而不是单个点。此外,在训练期间,损失函数包括一个额外的正则化项,用于控制潜在空间上的分布。因此,变分自编码器的损失函数包含重建项(例如均方距离)和正则化项来控制编码器输出的概率分布。正则化项可以是高斯分布的 KL 散度,因为基本假设是简单的概率模型最好地描述训练数据。换句话说,复杂的关系在概率上可以很简单。我们在这里必须小心,因为这会引入偏差:如果对潜在变量中的数据分布的简单假设太弱,则可能会成为一个缺点。也就是说,当对先验分布的假设或对近似后验分布的假设太弱时,即使有完美的优化算法和无限的训练数据,估计之间的差距 真实的对数似然可以导致 p de 学习与真实分布完全不同的分布 p dAtA

Variational autoencoders are probabilistic autoencoders: the encoder outputs probability distributions over the latent space z instead of single points. Moreover, during training, the loss function includes an extra regularization term that controls the distribution over the latent space. Therefore, the loss function for variational autoencoders contains a reconstruction term (such as mean squared distance) and a regularization term to control the probability distribution output by the encoder. The regularization term can be a KL divergence from a Gaussian distribution, since the underlying assumption is that simple probabilistic models best describe the training data. In other words, complex relationships can be probabilistically simple. We have to be careful here, since this introduces a bias: the simple assumption on the data distribution in the latent variable can be a drawback if it is too weak. That is, when the assumption on the prior distribution or the assumption on the approximate posterior distribution is too weak, even with a perfect optimization algorithm and infinite training data, the gap between the estimate and the true log-likelihood can lead to p model learning a completely different distribution than the true p data .

从数学上讲,我们最大化下界 数据的对数似然。在科学中,变分方法定义了我们想要最大化的能量泛函的下界,或者我们想要最小化的能量泛函的上限。这些界限通常更容易获得,并且具有易于处理的优化算法,即使对数似然则不然。同时,它们为我们正在寻找的最佳值提供了良好的估计:

Mathematically, we maximize a lower bound on the log-likelihood of the data. In science, variational methods define lower bounds on an energy functional that we want to maximize, or upper bounds on an energy functional that we want to minimize. These bounds are usually easier to obtain and have tractable optimization algorithms, even when the log-likelihood does not. At the same time, they provide good estimates for the optimal values that we are searching for:

X , θ 日志 p de X , θ

变分方法通常可以获得非常好的似然性,但样本的主观评价认为其生成的样本质量较低。它们也被认为比完全可见的信念网络更难优化。此外,人们发现他们的数学比完全可见的信念网络和生成对抗性的数学更困难网络(很快就会讨论)。

Variational methods often achieve very good likelihood, but subjective evaluation of samples regard their generated samples as having lower quality. They are also considered more difficult to optimize than fully visible belief networks. Moreover, people find their mathematics more difficult than that of fully visible belief networks and of generative adversarial networks (discussed soon).

显式密度-棘手:通过马尔可夫链的玻尔兹曼机逼近

Explicit Density-Intractable: Boltzman Machine Approximation via Markov Chain

玻尔兹曼机(起源于 20 世纪 80 年代)是一系列依赖马尔可夫链来训练生成模型的生成模型。这是一种采样技术,它比从数据集中简单采样小批量来估计损失函数更昂贵。我们将在第 11 章中讨论强化学习背景下的马尔可夫链。在数据生成的背景下,它们有许多缺点导致它们失宠:计算成本高、扩展到更高维度不切实际且效率较低、收敛速度慢、没有明确的方法知道模型是否收敛或是否收敛。不会,即使理论说它必须收敛。马尔可夫尺度方法还没有扩展到像 ImageNet 生成这样的问题。

Boltzmann machines (originating in the 1980s) are a family of generative models that rely on Markov chains to train generative models. This is a sampling technique that happens to be more expensive than the simple sampling of a mini-batch from a data set to estimate a loss function. We will discuss Markov chains in the context of reinforcement learning in Chapter 11. In the context of data generation, they have many disadvantages that caused them to fall out of favor: high computational cost, impractical and less efficient to extend to higher dimensions, slow to converge, and no clear way to know whether the model has converged or not, even when the theory says it must converge. Markov scale methods have not scaled to problems like ImageNet generation.

马尔可夫链有一个转换算子q,它对系统从一种状态转换到另一种状态的概率进行编码。该转换运算符q需要显式定义。我们可以通过重复抽取样本来生成数据样本 X ' q X ' | X , 更新 X ' 根据转移运算符q依次进行。与单步生成相比,这种生成的顺序性质是另一个缺点。马尔可夫链方法有时可以保证 x' 最终收敛到来自 p de X ,尽管收敛速度可能很慢。

A Markov chain has a transition operator q that encodes the probability of transitioning from one state of the system to another. This transition operator q needs to be explicitly defined. We can generate data samples by repeatedly drawing a sample x ' q ( x ' | x ) , updating x ' sequentially according to the transition operator q. This sequential nature of generation is another disadvantage compared to single step generation. Markov chain methods can sometimes guarantee that x’ will eventually converge to a sample from p model ( x ) , even though the convergence might be slow.

某些模型(例如深度玻尔兹曼机)同时采用马尔可夫链和变分近似。

Some models, such as deep Boltzman machines, employ both Markov chain and variational approximations.

隐式密度马尔可夫链:生成随机网络

Implicit Density-Markov Chain: Generative Stochastic Network

生成式随机网络(Bengio et al. 2014)没有明确定义密度函数,而是使用间接与 p de X 通过从训练数据中采样。该马尔可夫链算子必须运行多次才能从 p de X 。这些方法仍然存在上一节提到的马尔可夫链方法的缺点。

Generative stochastic networks (Bengio et al. 2014) do not explicitly define a density function, and instead use a Markov chain transition operator that interacts indirectly with p model ( x ) by sampling from the training data. This Markov chain operator must be run several times to obtain a sample from p model ( x ) . These methods still suffer from the shortcomings of Markov chain methods mentioned in the previous section.

隐式密度直接:生成对抗网络

Implicit Density-Direct: Generative Adversarial Networks

现在最流行的生成模型是:

Currently the most popular generative models are:

  • 完全可见的深度信念网络,例如 PixelCNN、WaveNet 及其变体

  • Fully visible deep belief networks, such as PixelCNN, WaveNet, and their variations.

  • 变分自动编码器,由概率编码器-解码器架构组成。

  • Variational autoencoders, consisting of a probabilistic encoder-decoder architecture.

  • 生成对抗网络,由于其概念的简单性和生成的样本的高质量而受到了科学界的广泛关注。我们现在讨论它们。

  • Generative adversarial networks, which have received a lot of attention from the scientific community due to the simplicity of their concept and the good quality of their generated samples. We discuss them now.

生成对抗网络由 Ian Goodfellow 等人于 2014 年提出。所涉及的数学是概率论和博弈论的完美结合。生成对抗网络避免了一些与其他生成模型相关的缺点:

Generative adversarial networks were introduced in 2014 by Ian Goodfellow et al. The mathematics involved is a beautiful mixture between probability and game theory. Generative adversarial networks avoid some disadvantages associated with other generative models:

  • 一次性、并行地生成样本,而不是像 PixelCNN 那样将一个新像素反馈回网络来预测该像素。

  • Generating samples all at once, in parallel, as opposed to feeding a new pixel back into the network to predict the one, such as in PixelCNN.

  • 生成器函数几乎没有限制。相对于玻尔兹曼机,很少有概率分布允许易处理的马尔可夫链采样,并且相对于非线性独立分量分析,这是一个优势,对于玻尔兹曼机,生成器必须是可逆的,并且潜在变量 z 必须与样本 x 具有相同维度

  • The generator function has few restrictions. This is an advantage relative to Boltzmann machines, for which few probability distributions admit tractable Markov chain sampling, and relative to nonlinear independent component analysis, for which the generator must be invertible and the latent variables z must have the same dimension as the samples x.

  • 生成对抗网络不需要马尔可夫链。这是相对于玻尔兹曼机和生成随机网络的优势。

  • Generative adversarial networks do not need Markov chains. This is an advantage relative to Boltzmann machines and to generative stochastic networks.

  • 虽然变分自动编码器如果假设先验或后验分布太弱,可能永远不会收敛到真实的数据生成分布,但生成对抗网络会收敛到真实的数据生成分布 p dAtA ,假设我们有无限的训练数据和足够大的模型。此外,生成对抗网络不需要变分界限,并且生成对抗网络框架中使用的特定模型族已经被认为是通用逼近器。因此,生成对抗网络已知是渐近一致的。另一方面,一些变分自编码器被推测是渐近一致的,但这仍然需要证明。

  • While variational autoencoders might never converge to the true data generating distribution if they assume prior or posterior distributions that are too weak, generative adversarial networks converge to the true p data , given that we have infinite training data and a large enough model. Moreover, generative adversarial networks do not need variational bounds, and the specific model families used within the generative adversarial network framework are already known to be universal approximators. Thus, generative adversarial networks are already known to be asymptotically consistent. On the other hand, some variational autoencoders are conjectured to be asymptotically consistent, but this still needs to be proven.

生成对抗网络的缺点是训练它们需要发现博弈的纳什均衡,这比仅仅优化目标函数更困难。此外,该解在数值上往往不稳定。Alec Radford 等人于 2015 年对此进行了改进。在他们的论文“深度卷积生成对抗网络的无监督表示学习”中。这种方法产生了更稳定的模型。

The disadvantage of generative adversarial networks is that training them requires spotting the Nash equilibrium of a game, which is more difficult than just optimizing an objective function. Moreover, the solution tends to be numerically unstable. This was improved in 2015 by Alec Radford et al. in their paper “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”. This approach led to more stable models.

在训练过程中,生成对抗网络在两个独立的网络之间制定游戏:生成器网络和鉴别器网络,试图将生成器样本分类为来自真实分布 p dAtA X 或从模型 p de X 。两个网络的损失函数是相关的,因此判别器传达两个分布之间的差异,生成器相应地调整其参数,直到它准确地再现真实的数据分布(理论上),使得判别器的分类不比随机猜测。

During training, generative adversarial networks formulate a game between two separate networks: a generator network and a discriminator network that tries to classify generator samples as either coming from the true distribution p data ( x ) or from the model p model ( x ) . The loss functions of the two networks are related, so that the discriminator communicates the discrepancy between the two distributions, and the generator adjusts its parameters accordingly until it exactly reproduces the true data distribution (in theory) so that the discriminator’s classifications are no better than random guesses.

生成器网络希望最大化判别器在分类中分配错误标签的概率,无论样本是来自训练数据还是来自模型,而判别器网络希望最小化该概率。这是一场两人零和游戏,一个玩家的收益就是另一个玩家的损失。我们最终解决的是极小极大问题,而不是纯粹的最大化或最小化问题。一个特别的解决方案存在。

The generator network wants to maximize the probability that the discriminator assigns the wrong label in its classification, whether the sample is from the training data or from the model, while the discriminator network wants to minimize that probability. This is a two-player zero-sum game, where one player’s gain is another’s loss. We end up solving a minimax problem instead of a purely maximizing or minimizing problem. A unique solution exists.

生成对抗网络如何工作?

How Do Generative Adversarial Networks Work?

保持学习生成器概率分布的目标 p G X ; θ 根据数据,生成对抗网络的学习进展如下:

Keeping the goal of learning the generator’s probability distribution p g ( x ; θ ) over the data, here’s how the learning progresses for generative adversarial networks:

  1. 从随机样本开始 z 从先验概率分布 p z z ,这可能只是每个分量的均匀随机噪声 z

  2. Start with a random sample z from a prior probability distribution p z ( z ) , which could be just uniform random noise for each component of z .

  3. 也从随机样本开始 X 来自训练数据,因此它是概率分布的样本 p dAtA X 生成器正在尝试学习。

  4. Start also with a random sample x from the training data, so it is a sample from the probability distribution p data ( x ) that the generator is trying to learn.

  5. 适用于 z 确定性函数 G z , θ G 代表生成神经网络。参数 θ G 是我们需要通过反向传播调整直到输出 G z , θ G 看起来与训练数据集中的样本相似。

  6. Apply to z the deterministic function G ( z , θ g ) representing the generative neural network. The parameters θ g are the ones we need to tweak via backpropagation until the output G ( z , θ g ) looks similar to samples from the training data set.

  7. 传递输出 G z , θ G 到另一个代表判别神经网络的确定性函数D中。现在我们有了新的输出 D Gz ,θ G , θ d 这只是一个接近一或零的数字,表示该样本是来自生成器还是来自训练数据。因此,对于来自生成器的输入 D Gz ,θ G , θ d 必须返回一个接近于一的数字。参数 θ d 是我们需要通过反向传播进行调整的那些,直到D在大约一半的时间内返回错误的分类。

  8. Pass the output G ( z , θ g ) into another deterministic function D representing the discriminative neural network. Now we have the new output D (G(z ,θ g ),( θ) d ) that is just a number closer to one or zero, signifying whether this sample came from the generator or from the training data. Thus, for this input from the generator D (G(z ,θ g ),( θ) d ) must return a number close to one. The parameters θ d are the ones we need to tweak via backpropagation until D returns the wrong classification around half of the time.

  9. 还传递样品 X 从训练数据到D,所以我们评估 D X , θ d 。对于这个输入, D X , θ d 必须返回一个接近于零的数字。

  10. Pass also the sample x from the training data to D, so we evaluate D ( x , θ d ) . For this input, D ( x , θ d ) must return a number close to zero.

  11. 这两个网络的损失函数是什么,其公式中有两组参数 θ G θ d ,以及采样向量 X z ?判别器函数D希望对两种类型的输入都正确, X G z , θ G 。所以它的参数 θ d 必须选择一个接近 1 的数字,当输入为 G z , θ G ,当输入为 X 。在这两种情况下,我们都可以使用对数函数的负数,因为该函数在 0 附近较大,在 1 附近较小。因此,D需要参数 θ d 最大化:

    𝔼 X p dAtA X [ 日志 D X , θ d ] + 𝔼 z p z z [ 日志 1 - D G z , θ G , θ d ]

    同时G需要参数 θ G 最大限度地减少 日志 1 - D G z , θ G , θ d 。结合起来,DG参与具有价值函数V ( D , G )的两人极小极大博弈:

    分钟 G 最大限度 D V D , G = 𝔼 X p dAtA X [ 日志 D X ] + 𝔼 z p z z [ 日志 1 - D G z ]
  12. What is the loss function for these two networks, that has in its formula both sets of parameters θ g and θ d , along with the sampled vectors x and z ? The discriminator function D wants to get it right for both types of inputs, x and G ( z , θ g ) . So its parameters θ d must be selected so that a number close to 1 is assigned a large score when the input is G ( z , θ g ) , and a number close to 0 is assigned a large value when the input is x . In both cases, we can use the negative of the log function since that is a function that is large near 0 and small near 1. Therefore, D needs the parameters θ d that maximize:

    𝔼 x p data (x ) [ log D ( x , θ d ) ] + 𝔼 z p z (z ) [ log ( 1 - D ( G ( z , θ g ) , θ d ) ) ]

    At the same time, G needs the parameters θ g that minimize log ( 1 - D ( G ( z , θ g ) , θ d ) ) . Combined, D and G engage in a two-player minimax game with value function V (D, G):

    min G max D V ( D , G ) = 𝔼 x p data (x ) [ log D ( x ) ] + 𝔼 z p z (z ) [ log ( 1 - D ( G ( z ) ) ) ]

这是一个非常简单的数学结构,其中设置鉴别器网络使我们能够更接近真实的数据分布,而无需明确定义它或假设任何关于它。

This is a very simple mathematical structure, where setting up a discriminator network allows us to get closer to the true data distribution without ever explicitly defining it or assuming anything about it.

最后,我们注意到生成对抗网络在许多应用中都非常有前景。一个例子是它们对半监督学习的显着增强,其中“NIPS 2016 教程:生成对抗网络”(Goodfellow 2016)报告:

Finally, we note that generative adversarial networks are highly promising for many applications. One example is the dramatic enhancement they have for semi-supervised learning, where the “NIPS 2016 Tutorial: Generative Adversarial Networks” (Goodfellow 2016) reports:

我们引入了一种利用生成对抗网络进行半监督学习的方法,该方法涉及判别器产生指示输入标签的附加输出。这种方法使我们能够在带有很少标记示例的设置中获得 MNIST、SVHN 和 CIFAR-10 的最先进结果。例如,在 MNIST 上,我们使用完全连接的神经网络,每个类仅使用 10 个标记示例,就实现了 99.14% 的准确率,这一结果非常接近使用所有 60,000 个标记示例的完全监督方法的最著名结果。这是非常有前途的,因为在实践中获得带标签的示例可能非常昂贵。

We introduce an approach for semi-supervised learning with generative adversarial networks that involves the discriminator producing an additional output indicating the label of the input. This approach allows us to obtain state of the art results on MNIST, SVHN, and CIFAR-10 in settings with very few labeled examples. On MNIST, for example, we achieve 99.14% accuracy with only 10 labeled examples per class with a fully connected neural network—a result that’s very close to the best known results with fully supervised approaches using all 60,000 labeled examples. This is very promising because labeled examples can be quite expensive to obtain in practice.

生成对抗网络(以及一般的机器学习)的另一个影响深远的应用是模拟高能物理数据。我们接下来讨论这个。

Another far-reaching application of generative adversarial networks (and machine learning in general) is simulating data for high energy physics. We discuss this next.

示例:高能物理的机器学习和生成网络

Example: Machine Learning and Generative Networks for High Energy Physics

以下讨论的灵感来自于2020 年喷气物理研讨会的机器学习以及两篇文章“深度学习及其在大型强子对撞机物理中的应用”(Guest 等人,2018 年)“用于高能稀疏数据生成的图形生成对抗网络”物理学”(Kansal 等人,2021)

The following discussion is inspired by and borrows from the Machine Learning for Jet Physics Workshop 2020 and the two articles “Deep Learning and Its Application to LHC Physics” (Guest et al. 2018) and “Graph Generative Adversarial Networks for Sparse Data Generation in High Energy Physics” (Kansal et al. 2021).

在2012年深度学习革命开始之前,高能物理领域传统上依赖于物理考虑和人类直觉的分析和计算、增强决策树、手工数据特征工程和降维以及传统统计分析。这些技术虽然富有洞察力,但自然远非最佳,并且难以自动化或扩展到更高的维度。多项研究表明,基于物理启发的工程高级特征的传统浅层网络的性能优于基于接受较少预处理的高维低层特征的深层网络。大型强子对撞机数据分析的许多领域都遭受了长期的次优特征工程的困扰,值得重新审查。因此,高能物理领域是机器学习应用的成熟温床。这方面正在取得很多进展。该领域正在采用多种机器学习技术,包括人工神经网络、核密度估计、支持向量机、遗传算法、增强决策树、随机森林和生成网络。

Before the deep learning revolution began in 2012, the field of high energy physics traditionally relied in its analyses and computations on physical considerations and human intuition, boosted decision trees, handcrafted data feature engineering and dimensionality reduction, and traditional statistical analysis. These techniques, while insightful, are naturally far from optimal and hard to automate or extend to higher dimensions. Several studies have demonstrated that traditional shallow networks based on physics-inspired engineered high-level features are outperformed by deep networks based on the higher-dimensional lower-level features that receive less preprocessing. Many areas of the Large Hadron Collider data analysis have suffered from long-standing suboptimal feature engineering, and deserve reexamination. Thus, the high energy physics field is a breeding ground ripe for machine learning applications. A lot of progress is taking place on this front. The field is employing several machine learning techniques, including artificial neural networks, kernel density estimation, support vector machines, genetic algorithms, boosted decision trees, random forests, and generative networks.

大型强子对撞机的实验计划探讨了现代物理学最根本的问题:质量的本质、空间的维度、基本力的统一、暗物质的粒子性、标准模型的微调。一个驱动目标是了解物质最基本的结构。其中一部分需要寻找和研究在大型强子对撞机等加速器的碰撞中产生的奇异粒子,例如顶夸克和希格斯玻色子。具体的基准和挑战包括质量重建、喷射子结构和喷射风味分类。例如,人们可以识别来自重夸克(c、b、t)或轻夸克(u、d、s)、胶子以及WZH玻色子的喷流。

The experimental program of the Large Hadron Collider probes the most fundamental questions in modern physics: the nature of mass, the dimensionality of space, the unification of the fundamental forces, the particle nature of dark matter, and the fine-tuning of the Standard Model. One driving goal is to understand the most fundamental structure of matter. Part of that entails searching for and studying exotic particles, such as the top quark and Higgs boson, produced in collisions at accelerators such as the Large Hadron Collider. Specific benchmarks and challenges include mass reconstruction, jet substructure, and jet-flavor classification. For example, one can identify jets from heavy (c, b, t) or light (u, d, s) quarks, gluons, and W, Z, and H bosons.

运行高能粒子实验并收集结果数据非常昂贵。就碰撞次数和每次碰撞的复杂性而言,收集到的数据非常庞大。此外,大部分加速器事件不会产生有趣的粒子(信号粒子与背景粒子)。信号粒子很少见,因此需要高数据速率。例如,大型强子对撞机探测器有O (108) 个传感器,用于记录每次碰撞后产生的大量粒子。因此,从实验数据中提取最大信息(考虑回归和分类模型),准确选择和识别事件以进行有效测量,并产生可靠的方法来模拟与实验产生的数据类似的新数据(考虑生成模型)至关重要。 )。高能物理数据的特点是其高维性以及许多信号事件的复杂拓扑。

Running high energy particle experiments and collecting the resulting data is extremely expensive. The data collected is enormous in terms of the number of collisions and the complexity of each collision. In addition, the bulk of accelerator events does not produce interesting particles (signal particles versus background particles). Signal particles are rare, so high data rates are necessary. For example, the Large Hadron Collider detectors have O(108) sensors used to record the large number of particles produced after each collision. It is thus of paramount importance to extract maximal information from experimental data (think regression and classification models), to accurately select and identify events for effective measurements, and to produce reliable methods for simulating new data similar to data produced by experiments (think generative models). High energy physics data is characterized by its high dimensionality, along with the complex topologies of many signal events.

这一讨论通过碰撞的性质及其产物与大型强子对撞机探测器的相互作用与我们的章节联系在一起。它们是量子力学的,因此特定相互作用产生的观察结果基本上是概率性的。然后必须用统计和概率术语来构建所得的数据分析。

This discussion ties into our chapter through the nature of collisions and the interaction of their products with Large Hadron Collider detectors. They are quantum mechanical, and therefore the observations resulting from a particular interaction are fundamentally probabilistic. The resulting data analysis must then be framed in statistical and probabilistic terms.

在我们的章节中,我们的目标是学习概率分布 p θ | X 给定观测数据的模型参数。如果数据的维度相当低,例如小于五个维度,则使用直方图或基于核的密度估计从模拟样本估计未知统计模型的问题并不困难。然而,我们不能轻易地将这些简单的方法扩展到更高的维度,因为维度的诅咒。在单维中,我们需要N 个样本来估计源概率密度函数,但在d维中,我们需要 d 。结果是,如果数据的维数大于 10 左右,则使用简单的方法来估计概率分布是不切实际的,甚至是不可能的,需要大量的计算资源。

In our chapter, our aim is to learn the probability distribution p ( θ | x ) of the model’s parameters given the observed data. If the data is fairly low dimensional, such as less than five dimensions, the problem of estimating the unknown statistical model from the simulated samples would not be difficult, using histograms or kernel-based density estimates. However, we cannot easily extend these simple methods to higher dimensions, due to the curse of dimensionality. In a single dimension, we would need N samples to estimate the source probability density function, but in d dimensions, we would need O ( N d ) . The consequence is that if the dimension of the data is greater than 10 or so, it is impractical or even impossible to use naive methods to estimate the probability distribution, requiring a prohibitive amount of computational resources.

传统上,高能物理学家通过一系列对单个碰撞事件和事件集合进行操作的步骤来降低数据的维度,从而解决维度灾难。这些既定的方法基于数​​据中特定的、手工设计的特征,其数量足够小,可以估计未知的概率分布 p X | θ 使用模拟工具生成的样本。显然,由于数据的复杂性和潜在新物理学的稀有性,及其微妙的特征,这种传统方法可能不是最理想的。机器学习消除了对手工设计特征和手动降维的需要,这些可能会丢失较低级别高维数据中的关键信息。此外,直接从传感器获得的较低层数据的结构非常适合成熟的神经网络模型,例如卷积神经网络和图神经网络;例如,几乎所有现代高能物理探测器中都存在的热量计的投影塔结构与图像的像素相似。

High energy physicists have traditionally dealt with the curse of dimensionality by reducing the dimension of the data through a series of steps that operate both on individual collision events and collections of events. These established approaches were based on specific, hand-engineered features in the data to a number small enough to allow the estimation of the unknown probability distribution p ( x | θ ) using samples generated by simulation tools. Obviously, due to complexity of the data and the rarity of potential new physics, along with its subtle signatures, this traditional approach is probably suboptimal. Machine learning eliminates the need for hand-engineering features and manual dimensionality reduction that can miss crucial information in the lower-level higher-dimensional data. Moreover, the structure of lower-level data obtained directly from the sensors fits very well with well-established neural network models, such as convolutional neural networks and graph neural networks; for example, the projective tower structure of calorimeters present in nearly all modern high energy physics detectors is similar to the pixels of an image.

然而请注意,虽然基于图像的方法已经成功,但实际探测器的几何形状并不完全规则,因此需要一些数据预处理来表示喷射图像。此外,喷射图像通常非常稀疏。不规则几何和稀疏性都可以使用基于图的卷积网络而不是通常的卷积网络来进行粒子数据建模。图卷积网络将卷积神经网络的应用扩展到不规则采样数据。它们能够处理具有复杂几何形状的稀疏、排列不变的数据。我们将在第 9 章讨论图网络。它们总是带有节点、边和一个编码图中关系的矩阵,称为邻接矩阵。在高能物理的背景下,射流的粒子代表图的节点,边编码粒子在学习的邻接矩阵中的接近程度。在高能物理中,基于图的网络已成功应用于分类、重建和生成任务。

Note, however, that while the image-based approach has been successful, the actual detector geometry is not perfectly regular, thus some data preprocessing is required to represent jet images. In addition, jet images are typically very sparse. Both irregular geometry and sparsity can be addressed using graph-based convolutional networks instead of the usual convolutional networks for our particle data modeling. Graph convolutional networks extend the application of convolutional neural networks to irregularly sampled data. They are able to handle sparse, permutation invariant data with complex geometries. We will discuss graph networks in Chapter 9. They always come with nodes, edges, and a matrix encoding the relationships in the graph, called an adjacency matrix. In the context of high energy physics, the particles of a jet represent the nodes of the graph, and the edges encode how close the particles are in a learned adjacency matrix. In high energy physics, graph-based networks have been successfully applied to classification, reconstruction, and generation tasks.

我们本章的主题是生成模型,或者生成类似于给定数据集的数据。生成或模拟忠实于高能物理中收集的实验数据的数据非常重要。在“高能物理中用于稀疏数据生成的图生成对抗网络”中,作者使用生成对抗网络框架开发了基于图的生成模型,用于模拟稀疏数据集,例如欧洲核子研究中心大型强子对撞机产生的数据集。

The subject of our chapter is generative models, or generating data similar to a given data set. Generating or simulating data faithful to the experimental data collected in high energy physics is of great importance. In “Graph Generative Adversarial Networks for Sparse Data Generation in High Energy Physics”, the authors develop graph-based generative models, using a generative adversarial network framework, for simulating sparse data sets like those produced at the CERN Large Hadron Collider.

作者通过训练和生成 MNIST 手写数字图像和质子-质子碰撞(如大型强子对撞机中的粒子射流)的稀疏表示来说明他们的方法。该模型成功生成稀疏 MNIST 数字和粒子射流数据。作者使用两个指标来量化真实数据和生成数据之间的一致性:基于图的Fréchet 起始距离和粒子和 jet 特征级1-Wasserstein 距离

The authors illustrate their approach by training on and generating sparse representations of MNIST handwritten digit images and jets of particles in proton-proton collisions like those at the Large Hadron Collider. The model successfully generates sparse MNIST digits and particle jet data. The authors use two metrics to quantify agreement between real and generated data: a graph-based Fréchet inception distance and the particle and jet feature-level 1-Wasserstein distance.

其他生成模型

Other Generative Models

我们已经讨论了最先进的生成模型(截至 2022 年),但如果我们不讨论朴素贝叶斯、高斯混合和玻尔兹曼机器模型,本章将是不完整的。还有很多其他的。话虽如此,Yann LeCun(Meta 副总裁兼首席人工智能科学家)提出了他对其中一些模型的看法:

We have discussed state-of-the-art generative models (as of 2022), but this chapter will be incomplete if we do not go over Naive Bayes, Gaussian mixture, and Boltzmann machine models. There are many others. That being said, Yann LeCun (VP and chief AI scientist at Meta) offers his perspective on some of these models:

2000 年代,语音识别、计算机视觉和自然语言处理领域的研究人员痴迷于不确定性的准确表示。这引发了一系列关于概率生成模型的工作,例如语音中的隐马尔可夫模型、视觉中的马尔可夫随机场和星座模型,以及 NLP 中的概率主题模型(例如,带有潜在狄利克雷分析的概率主题模型)。计算机视觉研讨会上存在关于生成模型与判别模型的争论。人们曾多次尝试用非参数贝叶斯方法构建对象识别系统,但都是徒劳的。其中大部分依赖于之前关于贝叶斯网络、因子图和其他图形模型的工作。这就是人们如何了解指数族、信念传播、循环信念传播、变分推理等。中餐馆流程、印度自助餐流程等。但几乎没有一项工作涉及学习表示的问题。假定已给出特征。假设已给出图形模型的结构及其潜在变量。我们所要做的就是通过线性组合特征来计算某种对数似然,然后使用上述复杂的推理方法之一来生成未知变量的边际分布,其中之一就是答案,例如,一个类别。事实上,指数族几乎意味着浅:对数似然可以表示为特征的线性参数化函数(或其简单组合)。学习模型的参数被视为另一个变分推理问题。有趣的是,几乎所有这些都与当今顶级的语音、视觉和 NLP 系统无关。事实证明,解决学习层次表示和复杂函数依赖的问题比能够使用浅层模型执行准确的概率推理要重要得多。这并不是说准确的概率推理没有用。

Researchers in speech recognition, computer vision, and natural language processing in the 2000s were obsessed with accurate representations of uncertainty. This led to a flurry of work on probabilistic generative models such as Hidden Markov Models in speech, Markov random fields and constellation models in vision, and probabilistic topic models in NLP, e.g., with latent Dirichlet analysis. There were debates at computer vision workshops about generative models vs discriminative models. There were heroic-yet-futile attempts to build object recognition systems with non-parametric Bayesian methods. Much of this was riding on previous work on Bayesian networks, factor graphs and other graphical models. That’s how one learned about exponential family, belief propagation, loopy belief propagation, variational inference, etc. Chinese restaurant process, Indian buffet process, etc. But almost none of this work was concerned with the problem of learning representations. Features were assumed to be given. The structure of the graphical model, with its latent variables, was assumed to be given. All one had to do was to compute some sort of log-likelihood by linearly combining features, and then use one of the above mentioned sophisticated inference methods to produce marginal distributions over the unknown variables, one of which being the answer, e.g., a category. In fact, exponential family pretty much means shallow: the log-likelihood can be expressed as a linearly parameterized function of features (or simple combinations thereof). Learning the parameters of the model was seen as just another variational inference problem. It’s interesting to observe that almost none of this is relevant to today’s top speech, vision, and NLP systems. As it turned out, solving the problem of learning hierarchical representations and complex functional dependencies was a much more important issue than being able to perform accurate probabilistic inference with shallow models. This is not to say that accurate probabilistic inference is not useful.

同样,他继续说道:

In the same vein, he continues:

生成对抗网络非常适合生成漂亮的图片(尽管它们正在被扩散模型或我喜欢称之为的“多步去噪自动编码器”所淘汰),但对于识别和表示学习,GAN 一直是一个很大的优势。失望。

Generative Adversarial Networks are nice for producing pretty pictures (though they are being edged out by diffusion models, or “multistep denoising auto-encoders” as I like to call them), but for recognition and representation learning, GANs have been a big disappointment.

尽管如此,从所有这些模型中仍然可以学到很多数学知识。根据我的经验,当我们看到数学被开发并用于特定目的时,我们会在更深层次上理解和保留数学,而不是仅仅训练大脑的神经元。许多数学家声称在证明尚未找到应用的理论时体验到了乐趣。我从来都不是其中之一。

Nevertheless, there is a lot of math to be learned from all these models. In my experience, we understand and retain math at a much deeper level when we see it developed and utilized for specific purposes, as opposed to only train the neurons of the brain. Many mathematicians claim to experience pleasure while proving theories that have yet to find applications. I was never one of those.

朴素贝叶斯分类模型

Naive Bayes Classification Model

朴素贝叶斯模型是一个非常简单的分类模型,我们也可以将其用作生成模型,因为它最终计算联合概率分布 p X , y k 用于确定数据的分类。训练数据有特征 X 和标签 y k 。因此,我们可以使用朴素贝叶斯模型通过从联合概率分布中采样来生成新的数据点和标签。

The Naive Bayes model is a very simple classification model that we can also use as a generative model, since it ends up computing a joint probability distribution p ( x , y k ) for the data to determine its classification. The training data has features x and labels y k . Therefore, we can use the Naive Bayes model to generate new data points, together with labels, by sampling from this joint probability distribution.

朴素贝叶斯模型的目标是计算类别的概率 y k 给定数据特征 X ,这是条件概率 p y k | X 。对于具有很多特征的数据(高维 X ),计算成本很高,因此我们使用贝叶斯规则并利用反向条件概率,从而得出联合概率分布。那是:

The goal of a Naive Bayes model is to compute the probability of the class y k given the data features x , which is the conditional probability p ( y k | x ) . For data with many features (high-dimensional x ), this is expensive to compute, so we use Bayes’ Rule and exploit the reverse conditional probability, which in turn leads to the joint probability distribution. That is:

p y k | X = py k pX |y k pX = pX ,y k pX

朴素贝叶斯模型做出了非常强大且朴素的假设,该假设在实践中比人们预期的效果更好,即数据特征在以类标签为条件时是相互独立的 y k 。这个假设有助于极大地简化分子中的联合概率分布,特别是当我们将其扩展为单变量条件概率的乘积时。以类标签为条件的特征独立性假设 y k 方法:

The Naive Bayes model makes the very strong and naive assumption, which in practice works better than one might expect, that the data features are mutually independent when conditioned on the class label y k . This assumption helps simplify the joint probability distribution in the numerator tremendously, especially when we expand it as a product of single variable conditional probabilities. The feature independence assumptions conditional on the class label y k means:

p X | X +1 , X +2 , , X n , y k = p X | y k

因此,联合概率分布因式分解为:

Thus, the joint probability distribution factors into:

p X , y k = p X 1 | X 2 , , X n , y k p X 2 | X 3 , , X n , y k p X n | y k p y k = p X 1 | y k p X 2 | y k p X n | y k p y k

现在,我们可以从训练数据中轻松估计以数据的每个类别为条件的这些单一特征概率。我们可以类似地估计每个类别的概率 p y k 根据训练数据,或者我们可以假设这些类别的可能性相同,因此 p y k = 1 数字

We can now estimate these single feature probabilities conditioned on each category of the data easily from the training data. We can similarly estimate the probability of each class p ( y k ) from the training data, or we can assume the classes are equally likely, so that p ( y k ) = 1 numberofclasses .

请注意,一般来说,生成模型会找到联合概率分布 p X , y k 标签之间 y k 和数据 X 。另一方面,分类模型计算条件概率 p y k | X 。他们专注于通过返回类来计算数据中不同类之间的决策边界 y k 概率最高。因此对于朴素贝叶斯分类器,它返回标签 y * 具有最高值 p y k | X ,这与的最高值 p X 1 | y k p X 2 | y k p X n | y k p y k

Note that in general, generative models find the joint probability distribution p ( x , y k ) between labels y k and data x . Classification models, on the other hand, calculate the conditional probabilities p ( y k | x ) . They focus on calculating the decision boundaries between different classes in the data by returning the class y k with the highest probability. So for the Naive Bayes classifier, it returns the label y * with the highest value for p ( y k | x ) , which is the same as the highest value for p ( x 1 | y k ) p ( x 2 | y k ) p ( x n | y k ) p ( y k ) .

高斯混合模型

Gaussian Mixture Model

在一个高斯混合模型,我们假设所有数据点都是由有限数量的具有未知参数(均值和协方差矩阵)的高斯分布的混合生成的。我们可以将混合模型视为类似于 k 均值聚类,但这里我们包含有关聚类中心(高斯均值)的信息以及每个聚类中数据分布的形状(由协方差确定)高斯分布)。为了确定数据中的簇数,高斯混合模型有时会实现贝叶斯信息准则。我们还可以限制我们的模型来控制混合中不同高斯的协方差:完整高斯、捆绑高斯、对角高斯、捆绑对角线和球形(参见图 8-3的说明)。

In a Gaussian mixture model, we assume that all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters (means and covariance matrices). We can think of mixture models as being similar to k-means clustering, but here we include information about the centers of the clusters (means of our Gaussians) along with the shape of the spread of the data in each cluster (determined by the covariance of the Gaussians). To determine the number of clusters in the data, Gaussian mixture models sometimes implement the Bayesian information criterion. We can also restrict our model to control the covariance of the different Gaussians in the mixture: full, tied, diagonal, tied diagonal, and spherical (see Figure 8-3 for an illustration).

埃麦0803
图 8-3。高斯混合协方差类型(图像源

最后,我们需要最大化数据的可能性来估计混合物的未知参数(协方差矩阵的平均值和条目)。

We finally need to maximize the likelihood of the data to estimate unknown parameters of the mixture (means and entries of the covariance matrices).

当数据中存在潜在或隐藏变量(未直接测量或观察的变量)时,最大似然变得棘手。解决这个问题的方法是使用期望最大化 (EM) 算法来估计最大似然。期望最大化算法的工作原理如下:

Maximum likelihood becomes intractable when there are latent or hidden variables in the data (variables that are not directly measured or observed). The way around this is to use an expectation maximization (EM) algorithm to estimate the maximum likelihood. The expectation maximization algorithm works as follows:

  1. 通过使用未知参数的当前估计创建对数似然期望的函数来估计潜在变量的值。

  2. Estimate the values for the latent variables by creating a function for the expectation of the log-likelihood using the current estimate for the unknown parameters.

  3. 优化:计算最大化步骤 1 中评估的预期对数似然的新参数。

  4. Optimize: compute new parameters that maximize the expected log-likelihood evaluated in step 1.

  5. 重复步骤 1 和 2,直至收敛。

  6. Repeat steps 1 and 2 until convergence.

我们可以看到高斯混合模型如何用作聚类、生成或分类模型。对于聚类来说,这是模型构建的主要部分。对于生成,通过期望最大化计算未知参数后从混合物中采样新的数据点。对于分类,给定一个新数据点,模型将其分配给它最可能所属的高斯分布。

We can see how Gaussian mixture models can be used as clustering, generative, or classification models. For clustering, this is the main part of the model buildup. For generation, sample new data points from the mixture after computing the unknown parameters via expectation maximization. For classification, given a new data point, the model assigns it to the Gaussian to which it most probably belongs.

生成模型的演变

The Evolution of Generative Models

在这个部分,我们讲述的故事有助于结束神经网络的冬天,并最终导致现代概率深度学习模型,例如变分自动编码器、完全可见的深度信念网络和生成对抗网络。我们遇到了从 Hopfield 网络到玻尔兹曼机再到受限玻尔兹曼机的进展。我对这些模型有一种特殊的喜爱:除了它们的历史价值,以及通过组装基本计算单元的网络来学习数据特征的联合概率分布之外,它们还采用了极其整洁和发达的领域的数学机制。统计力学,我最初的研究领域。

In this section, we tell the story that contributed to ending the winter of neural networks, and ultimately led to modern probabilistic deep learning models, such as variational autoencoders, fully visible deep belief networks, and generative adversarial networks. We encounter the progression from Hopfield nets to Boltzmann machines to restricted Boltzmann machines. I have a special affinity for these models: in addition to their historical value, and learning the joint probability distribution of the data features by assembling a network of basic computational units, they employ the mathematical machinery of the extremely neat and well-developed field of statistical mechanics, my initial area of research.

在统计力学中,我们根据能量函数定义概率分布。我们发现系统处于某种状态的概率 X 取决于它的能量 X 在那个状态。更准确地说,高能态的可能性较小,这表现为以下公式中指数的负号:

In statistical mechanics, we define probability distributions in terms of energy functions. The probability of us finding a system in a certain state x depends on its energy E ( x ) at that state. More precisely, high energy states are less probable, which manifests itself in the negative sign in the exponential in the following formula:

p X = eXp-X Z

指数函数保证p为正,分母中的配分函数 Z确保以下的和(或积分,如果x连续) p X 总体状态 X 为 1,使p成为有效的概率分布。以这种方式定义联合概率分布的机器学习模型是出于显而易见的原因,称为基于能量的模型。它们的不同之处在于如何分配每个状态的能量,这意味着它们使用的具体公式 X ,这反过来又影响配分函数Z的公式。公式为 X 包含模型的参数 θ ,我们需要使用最大似然估计根据数据进行计算。事实上,如果我们将pEZ依赖于 θ 联合概率分布公式中明确:

The exponential function guarantees that p is positive, and the partition function Z in the denominator ensures that the sum (or integral if x is continuous) of p ( x ) overall states x is 1, making p a valid probability distribution. Machine learning models that define joint probability distributions this way are called energy-based models, for obvious reasons. They differ in how they assign the energy at each state, meaning in the specific formula they use for E ( x ) , which in turn affects the formula for the partition function Z. The formula for E ( x ) contains the parameters of the model θ , which we need to compute from the data using maximum likelihood estimation. In fact, it is better if we have the dependence of p, E, and Z on θ explicit in the joint probability distribution formula:

p X , θ = eXp-X ,θ Zθ

在大多数情况下,不可能计算配分函数Z的封闭公式,从而使最大似然估计变得棘手。更准确地说,当我们最大化对数似然时,我们需要计算它的梯度,其中包括计算它相对于参数的梯度 θ ,这又迫使我们计算配分函数Z相对于 θ 。以下数量经常出现在这些计算中:

In most cases, it is not possible to compute a closed formula for the partition function Z, rendering the maximum likelihood estimation intractable. More precisely, when we maximize the log-likelihood, we need to compute its gradient, which includes computing its gradient with respect to the parameter θ , which in turn forces us to compute the gradient of the partition function Z with respect to θ . The following quantity appears frequently in these computations:

θ 日志 Z θ = 𝔼 X pX θ 日志 n e r A t r X , θ

在我们的例子中,基于能量的联合概率分布公式中的分子是 e X p - X , θ ,但这也可能因型号而异。

where in our case the numerator in the formula of the energy-based joint probability distribution is e x p ( - E ( x , θ ) ) , but this can also differ among models.

配分函数难以处理的情况促使我们求助于近似方法,例如随机最大似然和对比散度。其他方法回避近似配分函数并在不知道配分函数的情况下计算条件概率。他们利用条件概率的比率定义以及基于能量的联合概率分布定义中的比率,有效地消除了配分函数。这些方法包括分数匹配、比率匹配和去噪分数匹配。

The cases where the partition function is intractable urges us to resort to approximation methods such as stochastic maximum likelihood and contrastive divergence. Other methods sidestep approximating the partition function and compute conditional probabilities without knowledge of the partition function. They take advantage of the ratio definition of conditional probabilities, along with the ratio in the definition of an energy-based joint probability distribution, effectively canceling out the partition function. These methods include score matching, ratio matching, and denoising score matching.

其他方法,例如噪声对比估计、退火重要性采样、桥采样或依赖每种方法优点的组合,直接近似配分函数,而不是其梯度的对数。

Other methods, such as noise contrastive estimation, annealed importance sampling, bridge sampling, or a combination of these relying on the strengths of each, approximate the partition function directly, not the log of its gradient.

我们不会在这里讨论任何这些方法。相反,我们建议感兴趣的读者阅读Ian Goodfellow 等人的《深度学习》 。(2016)

We will not discuss any of these methods here. Instead we refer interested readers to Deep Learning by Ian Goodfellow et al. (2016).

回到霍普菲尔德网和玻尔兹曼机。这些是深度神经网络的垫脚石,这些神经网络通过最近的确定性和概率深度学习模型所依赖的反向传播进行训练。这些方法形成了学习任意概率分布的原始联结主义(神经元)方法,最初仅在零和一的二进制向量上,后来在具有任意实数的向量上数值。

Back to Hopfield nets and Boltzmann machines. These are the stepping stones to deep neural networks that are trained through backpropagation that the recent deterministic and probabilistic deep learning models rely on. These methods form the original connectionist (of neurons) approach to learning arbitrary probability distributions, initially only over binary vectors of zeros and ones, and later over vectors with arbitrary real number values.

霍普菲尔德网队

Hopfield Nets

霍普菲尔德网通过识别人工神经网络的神经元状态与物理系统中元素的状态,利用统计力学的优雅数学。尽管 Hopfield 网络最终被证明计算成本高昂且实际用途有限,但它们是现代神经网络时代的奠基人,如果只是为了衡量人工智能领域的历史演变,它们就值得探索。Hopfield 网络没有隐藏单元,并且所有(可见)单元都相互连接。每个单元都可以处于打开关闭状态(一或零),并且它们共同编码有关整个网络(或系统)的信息。

Hopfield nets take advantage of the elegant mathematics of statistical mechanics by identifying the states of neurons of an artificial neural network with the states of elements in a physical system. Even though Hopfield nets eventually proved to be computationally expensive and of limited practical use, they are the founding fathers of the modern era of neural networks, and are worth exploring if only to gauge the historical evolution of the AI field. Hopfield nets have no hidden units, and all their (visible) units are connected to each other. Each unit can be found in an on or off state (one or zero), and collectively they encode information about the whole network (or the system).

玻尔兹曼机

Boltzmann Machine

玻尔兹曼机是一个 Hopfield 网络,但添加了隐藏单元。我们已经熟悉神经网络中输入单元和隐藏单元的结构,因此无需解释它们,但这就是它们的起点。与 Hopfield 网络类似,输入和隐藏单元都是二进制的,其中状态为 0 或 1(现代版本实现的单元采用实数值,而不仅仅是二进制值)。

A Boltzmann machine is a Hopfield net, but with the addition of hidden units. We are already familiar with the structure of input units and hidden units in neural networks, so no need to explain them, but this is where they started. Similar to the Hopfield net, both input and hidden units are binary, where the states are either 0 or 1 (modern versions implement units that take real number values, not only binary values).

所有玻尔兹曼机都有一个棘手的配分函数,因此我们使用本节介绍中介绍的技术来近似最大似然梯度。

All Boltzmann machines have an intractable partition function, so we approximate the maximum likelihood gradient using the techniques surveyed at the introduction of this section.

玻尔兹曼机仅依赖于计算为他们的训练进行密集的吉布斯采样。吉布斯是一个在统计力学领域重复出现的名字。吉布斯抽样提供了网络权重的无偏估计,但这些估计具有很高的方差。一般来说,偏差和方差之间存在权衡,这种权衡凸显了依赖每种方法的优点和缺点。

Boltzmann machines rely only on the computationally intensive Gibbs sampling for their training. Gibbs is a name that appears repetitively in the statistical mechanics field. Gibbs sampling provides unbiased estimates of the weights of the network, but these estimates have high variance. In general, there is a trade-off between bias and variance, and this trade-off highlights the advantages and disadvantages of methods relying on each.

受限玻尔兹曼机(显式密度和棘手)

Restricted Boltzmann Machine (Explicit Density and Intractable)

玻尔兹曼机由于可见层和隐藏层内有许多互连(想想非常混乱的反向传播),学习率非常慢。这使得他们的训练非常缓慢并且阻碍了他们应用于实际问题。受限玻尔兹曼机仅限制不同层之间的连接,解决了这个问题。也就是说,受限玻尔兹曼机的每一层内没有连接,允许每一层中的所有单元同时更新。因此,对于两个连接层,我们可以通过交替更新每层中的所有单元来收集共现统计数据。实际上,由于采样程序(例如对比散度)最少,因此可以节省更大的成本。

Boltzmann machines have a very slow learning rate due to the many inter-connections within visible layers and within hidden layers (think a very messy backpropagation). This makes their training very slow and prohibits their application to practical problems. Restricted Boltzmann machines, which restrict connections only to those between different layers, solve this problem. That is, there are no connections within each layer of a restricted Boltzmann machine, allowing all of the units in each layer to be updated simultaneously. Therefore, for two connected layers, we can collect co-occurrence statistics by alternately updating all of the units in each layer. In practice, there are larger savings because of minimal sampling procedures, such as contrastive divergence.

有条件独立

Conditional independence

每层内缺乏连接意味着隐藏层中所有单元的状态彼此不依赖,但它们确实依赖于前一层单元的状态。换句话说,给定前一层单元的状态,每个隐藏单元的状态独立于隐藏层中其他单元的状态。这种条件独立性使我们能够分解隐藏层状态的联合概率 p H | H prevs 作为各个隐藏单元状态的条件概率的乘积。例如,如果我们在隐藏层中有三个单元, p H | H prevs = p H 1 | H prevs p H 2 | H prevs p H 3 | H prevs 。另一种方式也是如此,给定当前层的状态,前一层单元的状态有条件地彼此独立。这种条件独立性意味着我们可以对单位状态进行采样,而不是长时间迭代更新它们。

The lack of connections within each layer means that the states of all units in the hidden layer do not depend on each other, but they do depend on the states of units in the previous layer. In other words, given the states of the previous layer’s units, the state of each hidden unit is independent of the states of the other units in the hidden layer. This conditional independence allows us to factorize the joint probability of the state of a hidden layer p ( h | h previous ) as the product of the conditional probabilities of the states of individual hidden units. For example, if we have three units in a hidden layer, p ( h | h previous ) = p ( h 1 | h previous ) p ( h 2 | h previous ) p ( h 3 | h previous ) . The other way is also true, the states of the units of a previous layer are conditionally independent of each other given the states of the current layer. This conditional independence means that we can sample unit states instead of iteratively updating them for long periods of time.

万能近似

Universal approximation

受限玻尔兹曼机中的受限连接允许它们堆叠,即具有一系列能够提取更复杂特征的多个隐藏层。现在我们可以看到现代多层人工神经网络的架构是如何慢慢出现的。回想一下,在第 4 章中,我们讨论了神经网络对各种确定性函数的通用逼近。在本章中,我们希望我们的网络能够表示(或学习)联合概率分布而不是确定性函数。2008 年,Le Roux 和 Bengio 证明了玻尔兹曼机可以以任意精度逼近任何离散概率分布。该结果也适用于受限玻尔兹曼机。而且,在某些温和条件下,每增加一个隐藏层都会增加对数似然函数的值,从而使模型分布更接近训练集的真实联合概率分布。

The restricted connections in restricted Boltzmann machines allow for their stacking, that is, having a series of multiple hidden layers that are able to extract more complex features. We can see now how the architecture of the modern multilayer artificial neural network slowly emerged. Recall that in Chapter 4, we discussed the universal approximation of neural networks for a wide range of deterministic functions. In this chapter, we would like our networks to represent (or learn) joint probability distributions instead of deterministic function. In 2008, Le Roux and Bengio proved that Boltzmann machines can approximate any discrete probability distribution to an arbitrary accuracy. This result also applies to restricted Boltzmann machines. Moreover, under certain mild conditions, each additional hidden layer increases the value of the log-likelihood function, thus allowing the model distribution to be closer to the true joint probability distribution of the training set.

2015 年,Eldan 和 Shamir 凭经验验证,增加神经网络层数比增加网络层宽度按每层单元数量(深度与宽度)更有价值。我们还从实践中知道(无需证明),可以训练具有数百个隐藏层的网络,其中更深的层代表更高阶的特征。从历史上看,为了训练,必须克服梯度消失的问题深层网络。

In 2015, Eldan and Shamir verified empirically that increasing the number of layers of the neural networks is exponentially more valuable than increasing the width of network layers by the number of units in each layer (depth versus width). We also know from practice (without proofs) that it is possible to train a network with hundreds of hidden layers, where deeper layers represent higher-order features. Historically, the problem of vanishing gradients had to be overcome in order to train deep networks.

原始的自动编码器

The Original Autoencoder

自动编码器架构旨在将输入信息压缩到其低维隐藏层中。隐藏层应保留与输入层相同数量的信息,即使它们的单元数少于输入层。我们讨论了现代变分自动编码器,它为训练自动编码器网络提供了一种有效的方法。在训练期间,每个向量应该映射到自身(无监督),并且网络尝试学习最佳编码。输入层和输出层必须具有相同数量的单元。具有一定数量的输入单元、较少数量的隐藏单元以及具有与输入层相同数量的单元的输出层的玻尔兹曼机描述了原始的网络自动编码器架构。从历史的角度来看,这是很重要的:自动编码器是网络成功学习代码的第一个例子之一,代码隐含在隐藏单元的状态中,以表示其输入。这使得有可能迫使网络将其输入压缩到隐藏层,同时将信息损失降至最低。现在,这是我们认为理所当然的神经网络的一个组成部分。无论有或没有玻尔兹曼机(有或没有基于能量的联合概率分布),自动编码器架构在深度学习领域仍然非常有影响力。

The autoencoder architecture aims to compress the information of the input into its lower-dimensional hidden layers. The hidden layers should retain the same amount of information as the input layers, even when they have fewer units than the input layers. We have discussed modern variational autoencoders, which provide an efficient method for training autoencoder networks. During training, each vector should be mapped to itself (unsupervised), and the network tries to learn the best encoding. The input and the output layers must then have the same number of units. A Boltzmann machine set up with a certain number of input units, fewer number of hidden units, and an output layer with the same number of units as the input layer describes an original network autoencoder architecture. From a historical perspective this is significant: the autoencoder is one of the first examples of a network successfully learning a code, implicit in the states of the hidden units, to represent its inputs. This makes it possible to force a network to compress its input into a hidden layer with minimal loss of information. This is now an integral part of neural networks that we take for granted. The autoencoder architecture, with and without Boltzmann machines (with and without energy-based joint probability distribution), is still very influential in the deep learning world.

在本章前面,我们讨论了变分自动编码器。从历史的角度来看,它们综合了玻尔兹曼机自动编码器、深度自动编码器网络、去噪自动编码器和信息瓶颈的思想(Tishby et al. 2000),其根源在于综合分析的思想(Selridge 1958) 。变分自动编码器使用快速变分方法进行学习。在偏差-方差权衡的背景下,变分方法为网络权重提供了有偏差的估计低方差。

Earlier in this chapter, we discussed variational autoencoders. From a historical point of view, these synthesize the ideas of Boltzmann machine autoencoders, deep autoencoder networks, denoising autoencoders, and the information bottleneck (Tishby et al. 2000), which have their roots in the idea of analysis by synthesis (Selfridge 1958). Variational autoencoders use fast variational methods for their learning. In the context of bias-variance trade-off, variational methods provide biased estimates for the network’s weights that have low variance.

概率语言模型

Probabilistic Language Modeling

A本章与第 7 章之间的自然联系几乎完全集中于自然语言处理以及从自然语言数据中提取含义的各种方法,即调查概率语言模型背后的基础知识,然后重点介绍第 7 章中遵循这些模型模型基本面。

A natural connection between this chapter and Chapter 7, which focused almost exclusively on natural language processing and the various ways to extract meaning from natural language data, is to survey the fundamentals behind probabilistic language models, then highlight the models from Chapter 7 that adhere to these fundamentals.

本章开始于最大似然估计。当我们需要估计概率分布时,这种情况随处可见的原因之一是,在以下几个条件下,通过最大似然估计获得的概率分布得到数学理论的支持:最大似然估计确实收敛于真实分布 p dAtA X 当数据样本数量达到无穷大时(即假设我们有大量数据),并假设模型概率分布 p de X , θ 已经包含真实的概率分布。也就是说,在样本数趋于无穷大的极限下,模型参数 θ * 这将最大化数据满足的可能性 p de X , θ * = p dAtA X

This chapter started with maximum likelihood estimation. One of the reasons this appears everywhere when we need to estimate probability distributions is that the probability distribution attained via maximum likelihood estimation is supported by mathematical theory, under a couple conditions: maximum likelihood estimation does converge to the true distribution p data ( x ) that generated the data, in the limit as the number of data samples goes to infinity (that is, assuming we have a ton of data), and provided that the model probability distribution p model ( x , θ ) already includes the true probability distribution. That is, in the limit as the number of samples goes to infinity, the model parameters θ * that will maximize the likelihood of the data satisfy p model ( x , θ * ) = p data ( x ) .

在语言模型中,训练数据是来自某些语料库和/或流派的文本样本,我们希望了解其概率分布,以便我们可以生成相似的文本。重要的是要记住,真实的数据分布很可能包含在由 p de X , θ ,因此上一段的理论结果在实践中可能永远不成立;然而,这并没有阻止我们,我们通常会选择对我们的目的足够有用的模型。我们的目标是建立一个为语言片段分配概率的模型。如果我们随机组合一些语言片段,我们很可能会得到乱码。我们真正想要的是找到那些有意义的句子的分布。好的语言模型能够为有意义的句子分配高概率,即使这些句子不在训练数据中。人们通常在训练数据集上计算语言模型的困惑度来评估其性能。

In language models, the training data is samples of text from some corpus and/or genre, and we would like to learn its probability distribution so that we can generate similar text. It is important to keep in mind that the true data distribution is most likely not included in the family of distributions provided by p model ( x , θ ) , so the theoretical result in the previous paragraph might never hold in practice; however, this doesn’t deter us and we usually settle for models that are useful enough for our purposes. Our goal is to build a model that assigns probabilities to pieces of language. If we randomly assemble some pieces of language, we most likely end up with gibberish. What we actually want is to find the distribution of those sentences that mean something. A good language model is one that assigns high probabilities to sentences that are meaningful, even when these sentences are not among the training data. People usually compute the perplexity of a language model on the training data set to evaluate its performance.

语言模型基于这样的假设:下一个单词的概率分布取决于它之前的n-1 个单词,对于某些固定的n,因此我们关心计算 p de X n | X 1 , X 2 , , X n-1 。如果我们使用将每个单词的含义嵌入到向量中的 word2vec 模型,那么每个x都由一个向量表示。含义相似或在相似上下文中经常使用的单词往往具有相似的向量值。我们可以使用第 7 章中的 Transformer 模型根据前面的词向量来预测下一个词向量。

Language models are based on the assumption that the probability distribution of the next word depends on the n-1 words that preceded it, for some fixed n, so we care about calculating p model ( x n | x 1 , x 2 , , x n-1 ) . If we are using a word2vec model that embeds the meaning of each word in vector, then each of these x’s is represented by a vector. Words that mean things or are frequently used in similar contexts tend to have similar vector values. We can use the transformer model from Chapter 7 to predict the next word vector based on the preceding word vectors.

基于频率的语言模型通过计算单词在训练语料库中一起出现的次数来构建条件概率表。例如,我们可以通过计算“早上好”在语料库中出现的次数除以“好”在语料库中出现的次数来估计单词“早上”出现在“之后的条件概率p (morning | good )。那是:

Frequency-based language models construct conditional probability tables by counting the number of times words appear together in the training corpus. For example, we can estimate the conditional probability p(morning|good) of the word morning appearing after the word good, by counting the number of times good morning appears in the corpus divided by the number of times good appears in the corpus. That is:

p r n n G | G d = pGd,rnnG pGd

对于非常大的语料库或非结构化文本数据(例如推文、Facebook 评论或 SMS 消息),这种情况会出现问题,其中不完全遵守语法/拼写等通常规则。

This breaks down for very large corpuses or for unstructured text data such as tweets, Facebook comments, or SMS messages where the usual rules of grammar/spelling, etc. are not totally adhered to.

我们可以这样形式化概率语言模型的概念:

We can formalize the notion of a probabilistic language model this way:

  1. 指定您语言的词汇V。这可以是一组字符、空格、标点符号、符号、唯一单词和/或 n 元语法。从数学上讲,它是一个有限离散集,其中包含一个表示思想或句子结束的停止符号,例如英语中的句号(尽管句号并不总是意味着英语中句子的结束,例如当使用它时)用于缩写)。

  2. Specify the vocabulary V of your language. This could be a set of characters, spaces, punctuations, symbols, unique words, and/or n-grams. Mathematically, it is a finite discrete set that includes a stopping symbol signifying the end of a thought or sentence, like a period in English (even though a period does not always mean the end of a sentence in English, such as when it is used for abbreviations).

  3. 将一个句子(可能有意义或无意义)定义为有限的符号序列 X = X 1 , X 2 , , X 来自以停止符号结尾的词汇V。每个 X 可以采用词汇表V中的任何值。我们可以指定m作为句子的最大长度。

  4. Define a sentence (which could be meaningful or not) as a finite sequence of symbols x = ( x 1 , x 2 , , x m ) from the vocabulary V ending in the stop symbol. Each x i can assume any value from the vocabulary V. We can specify m as a maximum length for our sentences.

  5. 定义我们的语言空间 = { X 1 , X 2 , , X , X ε V } 作为长度小于或等于m的所有句子的集合。这些句子中的绝大多数没有任何意义,我们需要定义一个语言模型,只捕获有意义的句子:有意义的句子的概率高,无意义的句子的概率低。

  6. Define our language space l m = { ( x 1 , x 2 , , x m ) , x i V } as the set of all sentences of length less than or equal to m. The overwhelming majority of these sentences will mean nothing, and we need to define a language model that only captures the sentences that mean something: high probabilities for meaningful sentences and low probabilities for nonmeaningful ones.

  7. 是所有子集的集合 。这说明了最大长度m的所有有意义和无意义句子的集合。

  8. Let be the collection of all subsets of l m . This accounts for collections of all meaningful and meaningless sentences of maximal length m.

  9. 在严格的概率论中,我们通常从概率三元组开始:一个空间,一个包含该空间的一些子集的西格玛代数,以及分配给所选西格玛代数每个成员的概率测度(不必担心本章中的这些细节)。在这种情况下,语言模型是概率三元组:语言空间 ,由语言空间的所有子集组成的 sigma 代数 ,以及我们需要分配给每个成员的概率测度P 。由于我们的语言空间是离散且有限的,因此更容易 为每个成员分配概率p 相反,即每个句子的概率 X = X 1 , X 2 , , X (因为这反过来会导致所有子集的集合的概率测度P ,所以我们永远不会担心语言模型的这一点)。我们需要从训练数据中学习的正是这个p 。通常的方法是选择一个完整的概率分布族 p X ; θ 参数化为 θ

  10. In rigorous probability theory, we usually start with probability triples: a space, a sigma algebra containing some subsets of that space, and a probability measure assigned to each member of the chosen sigma algebra (do not worry about these details in this chapter). A language model, in this context, is the probability triple: the language space l m , the sigma algebra made up of all the subsets of the language space , and a probability measure P that we need to assign to each member of . Since our language space is discrete and finite, it is easier to assign a probability p to each member of l m instead, that is, a probability to each sentence x = ( x 1 , x 2 , , x m ) (since this will in turn induce a probability measure P on the collection of all subsets , so we will never worry about this for language models). It is this p that we need to learn from the training data. The usual approach is to select a full family of probability distributions p ( x ; θ ) parameterized by θ .

  11. 最后,我们需要估计参数 θ 通过最大化包含许多句子样本的训练数据集的可能性 。由于有意义的句子的概率非常小,因此我们使用这些概率的对数来避免下溢的风险。

  12. Finally, we need to estimate the parameter θ by maximizing the likelihood of the training data set that contains many sentence samples from l m . Since the probabilities of meaningful sentences are very small numbers, we use the logarithm of these probabilities instead to avoid the risk of underflow.

为了保持一致性,在以下内容中检查第 7 章中的对数线性模型和对数双线性模型 (GloVe) 以及潜在狄利克雷分配是一个很好的练习本节。

For consistency, it is a nice exercise to check log-linear models and log-bilinear models (GloVe) and latent Dirichlet allocation from Chapter 7 in the context of this section.

总结与展望

Summary and Looking Ahead

这是我们确定最先进的人工智能模型所需数学的旅程中的另一个基础章节。我们从前面章节中学习确定性函数转向学习数据特征的联合概率分布。目标是使用它们生成与训练数据类似的新数据。我们了解了概率分布的许多属性和规则,但尚未形式化。我们调查了最相关的模型,以及一些导致我们来到这里的历史演变。我们区分了为其联合分布提供显式公式的模型和与底层分布间接交互的模型,而无需显式写下公式。对于具有显式公式的模型,计算对数似然及其梯度可能很容易处理,也可能很棘手,每种方法都需要自己的方法。目标始终是相同的:通过找到最大化其对数似然的模型来捕获数据的潜在真实联合概率分布。

This was another foundational chapter in our journey to pinpoint the mathematics that is required for the state-of-the-art AI models. We shifted from learning deterministic functions in earlier chapters to learning joint probability distributions of data features. The goal is to use those to generate new data similar to the training data. We learned, still without formalizing, a lot of properties and rules for probability distributions. We surveyed the most relevant models, along with some historical evolution that led us here. We made the distinction between models that provide explicit formulas for their joint distributions and models that interact indirectly with the underlying distribution, without explicitly writing down formulas. For models with explicit formulas, computing the log-likelihood and its gradients can be tractable or intractable, each of which requires its own methods. The goal is always the same: capture the underlying true joint probability distribution of the data by finding a model that maximizes its log-likelihood.

如果我们的数据是低维的,只有一两个特征,那么这些都没有必要。直方图和核密度估计器可以很好地估计低维数据的概率分布。机器学习的最佳成就之一是能够根据大量数据对高维联合概率分布进行建模。

None of this would have been necessary if our data was low dimensional, with one or two features. Histograms and kernel density estimators do a good job of estimating probability distributions for low-dimensional data. One of the best accomplishments in machine learning is the ability to model high-dimensional joint probability distributions from a big volume of data.

我们在本章中介绍的所有方法都有其优点和缺点。例如,变分自动编码器允许我们在具有隐藏(潜在)变量的概率图模型中执行学习和有效的贝叶斯推理。然而,它们生成的样本质量较低。生成对抗网络可以生成更好的样本,但由于其训练动态不稳定,因此更难以优化。他们寻找不稳定的鞍点而不是稳定的最大值或最小值。PixelCNN和WaveNet等深度信念网络具有稳定的训练过程,优化了softmax损失函数。然而,它们在采样过程中效率低下,并且不能像自动编码器那样将数据有机地嵌入到较低维度中。

All of the approaches that we presented in this chapter have their pros and cons. For example, variational autoencoders allow us to perform both learning and efficient Bayesian inference in probabilistic graphical models with hidden (latent) variables. However, they generate lower-quality samples. Generative adversarial networks generate better samples, but they are more difficult to optimize due to their unstable training dynamics. They search for an unstable saddle point instead of a stable maximum or minimum. Deep belief networks such as PixelCNN and WaveNet have a stable training process, optimizing the softmax loss function. However, they are inefficient during sampling and don’t organically embed data into lower dimensions, as autoencoders do.

由于生成对抗网络的设置,博弈论中的两人零和博弈自然出现在本章中。

Two-player zero-sum games from game theory appeared naturally in this chapter due to the setup of generative adversarial networks.

展望下一章有关图形建模的内容,我们注意到神经网络图中的连接决定了我们编写条件概率的方式,轻松指出各种依赖性和条件独立性。我们在本章讨论受限玻尔兹曼机时看到了这一点。在下一章中,我们将专门关注图形建模,而本书的四分之三的内容我们都设法避免了图形建模。

Looking ahead into the next chapter on graphical modeling, we note that the connections in the graph of a neural network dictate the way we can write conditional probabilities, easily pointing out the various dependencies and conditional independences. We saw this while discussing restricted Boltzmann machines in this chapter. In the next chapter, we focus exclusively on graphical modeling, which we have managed to avoid for a good three-quarters of the book.

第 9 章图模型

Chapter 9. Graph Models

现在这是我们都想学习的东西。

H。

Now this is something we all want to learn.

H.

图表、图表、网络无处不在:城市和路线图、机场和转机航班、电力网络、电网、万维网、分子网络、生物网络(例如我们的神经系统)、社交网络、恐怖组织网络、数学的示意图模型、人工神经网络等等。它们很容易识别,不同的节点代表我们关心的一些实体,然后通过有向或无向边连接,表明连接的节点之间存在某种关系。

Graphs, diagrams, and networks are all around us: cities and roadmaps, airports and connecting flights, electrical networks, the power grid, the World Wide Web, molecular networks, biological networks such as our nervous system, social networks, terrorist organization networks, schematic representations of mathematical models, artificial neural networks, and many, many others. They are easily recognizable, with distinct nodes representing some entities that we care for, which are then connected by directed or undirected edges indicating the presence of some relationship between the connected nodes.

具有自然图结构的数据可以通过利用和保留该结构的机制来更好地理解,构建直接在图上操作的函数(尽管它们是数学表示的),而不是将图数据输入到机器学习模型中,然后人为地重塑它。分析它。这不可避免地导致有价值信息的丢失。这与卷积神经网络在图像数据上成功、循环神经网络在序列数据上成功等等的原因相同。

Data that has a natural graph structure is better understood by a mechanism that exploits and preserves that structure, building functions that operate directly on graphs (however they are mathematically represented), as opposed to feeding graph data into machine learning models that artificially reshape it before analyzing it. This inevitably leads to loss of valuable information. This is the same reason convolutional neural networks are successful with image data, recurrent neural networks are successful with sequential data, and so on.

基于图的模型对于数据科学家和工程师来说非常有吸引力。图结构提供了具有固定底层坐标系的空间所不具备的灵活性,例如在欧几里得空间或关系数据库中,其中数据及其特征被迫遵守严格且预定的形式。此外,图表是我们研究数据集中各点之间关系的自然环境。到目前为止,我们的机器学习模型使用了表示为孤立数据点的数据。另一方面,图模型消耗孤立的数据点以及它们之间的连接,从而允许更深入的理解和更具表现力的模型。

Graph-based models are very attractive for data scientists and engineers. Graph structures offer a flexibility that is not afforded in spaces with a fixed underlying coordinate system, such as in Euclidean spaces or in relational databases, where the data along with its features is forced to adhere to a rigid and predetermined form. Moreover, graphs are the natural setting that allows us to investigate the relationships between the points in a data set. So far, our machine learning models consumed data represented as isolated data points. Graph models, on the other hand, consume isolated data points, along with the connections between them, allowing for deeper understanding and more expressive models.

人脑自然地内化图形结构:它能够对实体及其连接进行建模。它还足够灵活,可以生成新的网络,或扩展和增强现有的网络,例如,在城市规划、项目规划或不断更新交通网络时。此外,人类可以无缝地从自然语言文本过渡到图形模型,反之亦然。当我们读到新的东西时,我们很自然地会用图形表示来更好地理解它或向其他人解释它。相反,当我们看到图表原理图时,我们能够通过自然语言来描述它们。目前存在基于知识图生成自然语言文本的模型,反之亦然。这是称为知识图推理。

The human brain naturally internalizes graphical structures: it is able to model entities and their connections. It is also flexible enough to generate new networks, or expand and enhance existing ones, for example, when city planning, project planning, or when continuously updating transit networks. Moreover, humans can transition from natural language text to graph models and vice versa seamlessly. When we read something new, we find it natural to formulate a graphical representation to better comprehend it or illustrate it to other people. Conversely, when we see graph schematics, we are able to describe them via natural language. There are currently models that generate natural language text based on knowledge graphs and vice versa. This is called reasoning over knowledge graphs.

此时,我们对神经网络的构建模块以及它们通常适合的数据类型和任务非常满意:

At this point we are pretty comfortable with the building blocks of neural networks, along with the types of data and tasks they are usually suited for:

主要任务主要是分类、回归、聚类、编码和解码或新数据生成,其中模型学习数据特征的联合概率分布。

The main tasks are mostly classification, regression, clustering, coding and decoding, or new data generation, where the model learns the joint probability distribution of the data features.

我们还熟悉这样一个事实:我们可以混合和匹配神经网络的一些组件来构建针对特定任务的新模型。好消息是图神经网络使用完全相同的成分,因此我们不需要在本章中讨论任何新的机器学习概念。一旦我们了解了如何以数学方式表示图形数据及其特征,以便将其输入到神经网络中,无论是用于分析还是用于新的网络(图形)数据生成,我们就可以开始了。因此,我们将避免对所有图神经网络进行迷宫般的调查。相反,我们将重点关注简单的数学公式、流行的应用、图模型的常见任务、可用的数据集和模型评估方法。我们的目标是培养对该主题运作的强烈直觉。主要挑战仍然是,以一种易于计算和分析的方式降低问题的维度,同时保留最多的信息。换句话说,对于拥有数百万用户的网络,我们不能期望我们的模型将具有数百万维的向量或矩阵作为输入。我们需要有效的图数据表示方法。

We are also familiar with the fact that we can mix and match some of the components of neural networks to construct new models that are geared toward specific tasks. The good news is that graph neural networks use the same exact ingredients, so we do not need to go over any new machine learning concepts in this chapter. Once we understand how to mathematically represent graph data along with its features in a way that can be fed into a neural network, either for analysis or for new network (graph) data generation, we are good to go. We will therefore avoid going down a maze of surveys for all the graph neural networks out there. Instead, we will focus on the simple mathematical formulation, popular applications, common tasks for graph models, available data sets, and model evaluation methods. Our goal is to develop a strong intuition for the workings of the subject. The main challenge is, yet again, lowering the dimensionality of the problem in a way that makes it amenable to computation and analysis, while preserving the most amount of information. In other words, for a network with millions of users, we cannot expect our models to take as input vectors or matrices with millions of dimensions. We need efficient representation methods for graph data.

如果你想更深入、快速地研究图神经网络,那么调查论文“图神经网络综合调查”(Wu et al. 2019)是一个很好的起点(当然,只有在仔细阅读本文之后)章节)。

If you want to dive deeper and fast track into graph neural networks, the survey paper “A Comprehensive Survey on Graph Neural Networks” (Wu et al. 2019) is an excellent place to start (of course, only after carefully reading this chapter).

图:节点、边以及每个图的特征

Graphs: Nodes, Edges, and Features for Each

图表是自然非常适合对任何问题进行建模,其目标是通过对象之间的关系来理解对象的离散集合(重点是离散的和不连续的)。图论是离散数学和计算机科学中相对年轻的学科,具有几乎无限的应用。这个领域需要更多的人才来解决许多未解决的问题。

Graphs are naturally well suited to model any problem where the goal is to understand a discrete collection of objects (with emphasis on discrete and not continuous) through the relationships among them. Graph theory is a relatively young discipline in discrete mathematics and computer science with virtually unlimited applications. This field is in need of more brains to tackle its many unsolved problems.

图表(见图9-1)由以下部分组成:

A graph (see Figure 9-1) is made up of:

节点或顶点
Nodes or vertices

捆绑式一起在一个集合中作为 d e s = { n d e 1 , n d e 2 , , n d e n } 。这可以少至几个节点(甚至一个节点),也可以大至数十亿个节点。

Bundled together in a set as N o d e s = { n o d e 1 , n o d e 2 , , n o d e n } . This can be as little as a handful of nodes (or even one node), or as massive as billions of nodes.

边缘
Edges

正在连接有向(从一个节点指向另一个节点)或无向(该边没有从任一节点到该节点的方向)中的任意两个节点(这可以包括从一个节点到其自身的边,或连接相同两个节点的多个边)其他)。边的集合是 d G e s = { e d G e j = n d e , n d e j ,这样就有一条边指向 n d e n d e j }

Connecting any two nodes (this can include an edge from a node to itself, or multiple edges connecting the same two nodes) in a directed (pointing from one node to the other) or undirected way (the edge has no direction from either node to the other). The set of edges is E d g e s = { e d g e ij = ( n o d e i , n o d e j ) , such that there is an edge pointing from n o d e i to n o d e j } .

节点特点
Node features

我们可以分配给每个 n d e 例如,将d 个特征(例如社交媒体用户的年龄、性别和收入水平)捆绑在一起的列表 FeAtres nde 。然后我们可以将图的所有n 个节点的所有特征向量捆绑在一个矩阵中 F e A t r e s des 尺寸的 d × n

We can assign to each n o d e i a list of, say, d features (such as the age, gender, and income level of a social media user) bundled together in a vector features node i . We can then bundle all the feature vectors of all the n nodes of the graph in a matrix F e a t u r e s Nodes of size d × n .

边缘特征
Edge features

同样,我们可以赋值给每个 e d G e j 比如说,c个特征的列表(例如道路的长度、速度限制以及是否是收费公路)捆绑在一个向量中 FeAtres edGe j 。然后我们可以将图的所有m个边的所有特征向量捆绑在一个矩阵中 F e A t r e s dGes 尺寸的 C ×

Similarly, we can assign to each e d g e ij a list of, say, c features (such as the length of a road, its speed limit, and whether it is a toll road or not) bundled together in a vector features edge ij . We can then bundle all the feature vectors of all the m edges of the graph in a matrix F e a t u r e s Edges of size c × m .

埃麦0901
图 9-1。图由节点和连接节点的有向或无向边组成

图模型之所以强大,是因为它们很灵活,并且不一定被迫遵守严格的网格状结构。我们可以将它们的节点视为漂浮在空间中,没有任何坐标。它们仅通过连接它们的边缘连接在一起。然而,我们需要一种方法来表示它们的内在结构。有一些软件包可以在给定节点和边集的情况下可视化图形,但我们无法对这些漂亮(且信息丰富)的图片进行分析和计算。我们可以将两种流行的图表示形式用作机器学习模型的输入:图的邻接矩阵及其关联矩阵

Graph models are powerful because they are flexible and not necessarily forced to adhere to a rigid grid-like structure. We can think of their nodes as floating through space with no coordinates whatsoever. They are only held together by the edges that connect them. However, we need a way to represent their intrinsic structure. There are software packages that visualize graphs given their sets of nodes and edges, but we cannot do analysis and computations on these pretty (and informative) pictures. There are two popular graph representations that we can use as inputs to machine learning models: a graph’s adjacency matrix and its incidence matrix.

还有其他对图论算法有用的表示形式,例如边列表两个线性数组后继列表。所有这些表示都传达相同的信息,但其存储要求以及图检索、搜索和操作的效率有所不同。大多数图神经网络将邻接矩阵以及节点和边的特征矩阵作为输入。很多时候,在将图形数据输入模型之前,他们必须进行降维(称为图形表示或图形嵌入)。其他时候,降维步骤是模型本身的一部分。

There are other representations that are useful for graph theoretic algorithms, such as edge listing, two linear arrays, and successor listing. All of these representations convey the same information but differ in their storage requirements and the efficiency of graph retrieval, search, and manipulation. Most graph neural networks take as input the adjacency matrix along with the feature matrices for the nodes and the edges. Many times, they must do a dimension reduction (called graph representation or graph embedding) before feeding the graph data into a model. Other times, the dimension reduction step is part of the model itself.

邻接矩阵
Adjacency matrix

在机器上存储图的结构并研究其属性的代数方法是通过邻接矩阵,它是一个 n × n 谁的条目 A d j A C e n C y j = 1 如果有一条边来自 n d e n d e j , 和 A d j A C e n C y j = 0 如果没有边 n d e n d e j 。请注意,此定义能够容纳自边,即从顶点到自身的边,但不能容纳两个不同节点之间的多条边,除非我们决定将数字 2、3 等包含为邻接矩阵中的条目。然而,这可能会扰乱图论学家使用邻接矩阵建立的一些结果。

One algebraic way to store the structure of a graph on a machine and study its properties is through an adjacency matrix, which is an n × n whose entries a d j a c e n c y ij = 1 if there is an edge from n o d e i n o d e j , and a d j a c e n c y ij = 0 if there is no edge from n o d e i n o d e j . Note that this definition is able to accommodate a self edge, which is an edge from a vertex to itself, but not multiple edges between two distinct nodes, unless we decide to include the numbers 2, 3, etc. as entries in the adjacency matrix. This, however, can mess up some results that graph theorists have established using the adjacency matrix.

关联矩阵
Incidence matrix

是另一种存储图结构并保留其完整信息的代数方法。在这里,我们列出节点和边,然后制定一个矩阵,其行对应于顶点,其列对应于边。一个条目 n C d e n C e j 矩阵的值为 1,如果 e d G e j 连接 n d e 到其他节点,否则为零。请注意,此定义能够容纳两个不同节点之间的多条边,但不能容纳从节点到自身的自边。由于许多图的边多于顶点,因此该矩阵往往比邻接矩阵更宽且尺寸更大。

This is another algebraic way to store the structure of the graph and retain its full information. Here, we list both the nodes and the edges, then formulate a matrix whose rows correspond to the vertices and whose columns correspond to the edges. An entry i n c i d e n c e ij of the matrix is 1 if e d g e j connects n o d e i to some other node, and zero otherwise. Note that this definition is able to accommodate multiple edges between two distinct nodes, but not a self edge from a node to itself. Since many graphs have more edges than vertices, this matrix tends to be very wide and larger in size than the adjacency matrix.

拉普拉斯矩阵是与无向图关联的另一个矩阵。它是一个 n × n 对称矩阵,其中每个节点都有对应的行和列。拉普拉斯矩阵的对角线条目等于每个节点的度数,如果与该条目对应的节点之间没有边,则非对角线条目为零;如果它们之间有边,则非对角线条目为 -1。这是微积分和偏微分方程中连续拉普拉斯算子的离散模拟,其中离散化发生在图的节点处。

The Laplacian matrix is another matrix that is associated with an undirected graph. It is an n × n symmetric matrix where each node has a corresponding row and column. The diagonal entries of the Laplacian matrix are equal to the degree of each node, and the off-diagonal entries are zero if there is no edge between nodes corresponding to that entry, and -1 if there is an edge between them. This is the discrete analog of the continuous Laplace operator from calculus and partial differential equations where the discretization happens at the nodes of the graph.

拉普拉斯矩阵考虑了连续(和两次可微)函数的二阶导数,它测量函数的凹性,或者一个点的值与周围点的值的差异程度。与连续拉普拉斯算子类似,拉普拉斯矩阵提供了图在一个节点处与其附近节点处的值不同的程度的度量。当我们研究图上的随机游动以及研究电气网络和电阻时,就会出现图的拉普拉斯矩阵。我们将在本章后面看到这些。

The Laplacian matrix takes into account the second derivatives of a continuous (and twice differentiable) function, which measure the concavity of a function, or how much its value at a point differs from its value at the surrounding points. Similar to the continuous Laplacian operator, the Laplacian matrix provides a measure of the extent a graph differs at one node from its values at nearby nodes. The Laplacian matrix of a graph appears when we investigate random walks on graphs and when we study electrical networks and resistances. We will see these later in this chapter.

我们可以从邻接矩阵和关联矩阵轻松推断出简单的节点和边统计信息,例如节点的度(节点的度是连接到该节点的边的数量)。度分布P ( k )反映了所有节点度的变异性。P ( k ) 是节点恰好具有k条边的经验概率。这对于许多网络来说都很重要,例如网络连接和生物网络。

We can easily infer simple node and edge statistics from the adjacency and incidence matrices, such as the degrees of nodes (the degree of a node is the number of edges connected to this node). The degree distribution P(k) reflects the variability in the degrees of all the nodes. P(k) is the empirical probability that a node has exactly k edges. This is of interest for many networks, such as web connectivity and biological networks.

例如,如果图中k度节点的分布遵循以下形式的幂律 k = k -α ,那么此类图具有很少的高连接性节点或集线器(集线器),它们是网络拓扑的中心,将网络拓扑保持在一起,以及许多连接到集线器的低连接性节点。

For example, if the distribution of nodes of degree k in a graph follows a power law of the form P ( k ) = k -α , then such graphs have few nodes of high connectivity, or hubs, which are central to the network topology, holding it together, along with many nodes with low connectivity, which connect to the hubs.

我们还可以添加时间依赖性,并考虑其属性随着时间的演变而变化的动态图。目前,有些模型向节点和/或边缘特征向量添加了时间依赖性(因此这些向量的每个条目都变得依赖于时间)。例如,对于预测旅行路线的 GPS 系统,连接地图上一个点到另一个点的边缘特征会根据交通情况随时间变化。

We can also add time dependency, and think of dynamic graphs whose properties change as time evolves. Currently, there are models that add time dependency to the node and/or edge feature vectors (so each entry of these vectors becomes time dependent). For example, for a GPS system that predicts travel routes, the edge features connecting one point on the map to another change with time depending on the traffic situation.

现在我们有了图对象的数学框架及其节点和边特征,我们可以将这些代表性向量和矩阵(以及监督模型的标签)输入到机器学习模型中并照常进行业务。大多数时候,故事的一半是对手头的物体进行良好的再现。故事的另一半是机器学习模型的总体表达能力,我们无需编码(甚至不必学习)导致这些结果的规则即可获得良好的结果。就本章而言,这意味着我们可以在学习正确的图论之前直接跳入图神经网络。

Now that we have a mathematical framework for graph objects, along with their node and edge features, we can feed these representative vectors and matrices (and labels for supervised models) into machine learning models and do business as usual. Most of the time, half of the story is having a good representation for the objects at hand. The other half of the story is the expressive power of machine learning models in general, where we can get good results without encoding (or even having to learn) the rules that lead to these results. For the purposes of this chapter, this means that we can jump straight into graph neural networks before learning proper graph theory.

有向图

Directed Graphs

为了有向图,一方面,我们感兴趣的是与无向图相同的属性,例如它们的生成树、基本电路、割集、平面性、厚度等。另一方面,有向图有其与无向图不同的独特属性,例如强连通性、树状(有根树的有向形式)、去环化和其他的。

For directed graphs, on the one hand, we are interested in the same properties as undirected graphs, such as their spanning trees, fundamental circuits, cut sets, planarity, thickness, and others. On the other hand, directed graphs have their own unique properties that are different than undirected graphs, such as strong connectedness, arborescence (a directed form of rooted tree), decyclization, and others.

示例:PageRank 算法

Example: PageRank Algorithm

PageRank是一个Google 用于在其搜索引擎结果中对网页进行排名的已退役算法(于 2019 年到期)。它根据有多少其他页面链接到该网页来衡量该网页的重要性。在图语言中,节点是网页,有向边是从一个页面指向另一页面的链接。根据PageRank,节点是当它有很多其他网页指向它时,即它的传入度很大时(见图9-2),它很重要。

PageRank is a retired algorithm (expired in 2019) that Google used to rank web pages in its search engine results. It provides a measure for the importance of a web page based on how many other pages link to it. In graph language, the nodes are the web pages, and the directed edges are the links pointing from one page to another. According to PageRank, node is important when it has many other web pages pointing to it, that is, when its incoming degree is large (see Figure 9-2).

埃麦0902
图 9-2。PageRank 为指向(或链接)更多页面的页面提供更高的分数(图片来源

作为涉及图、邻接矩阵、线性代数和网络的具体示例,让我们了解一下简化得离谱的万维网的 PageRank 算法,该万维网仅由四个索引网页组成(如图 9-3 所示),而不是数十亿网页。

As a concrete example involving graphs, adjacency matrix, linear algebra, and the web, let’s walk through the PageRank algorithm for an absurdly simplified World Wide Web consisting of only four indexed web pages, such as in Figure 9-3, as opposed to billions.

埃麦0903
图 9-3。一个虚构的万维网,仅由四个索引网页组成(改编自Coursera:机器学习数学)。

在图9-3的图中,只有B链接到A;A和D链接到B;A和D链接到C;A、B、C 链接到 D;A 链接到 B、C 和 D;B链接到A和D;C链接到D;D 链接到 B 和 C。

In the graph of Figure 9-3, only B links to A; A and D link to B; A and D link to C; A, B, and C link to D; A links to B, C, and D; B links to A and D; C links to D; and D links to B and C.

让我们想象一个网络冲浪者从某个页面开始,然后随机单击该页面的链接,然后单击该新页面的链接,依此类推。这surfer 模拟网络图上的随机游走

Let’s think of a web surfer who starts at some page then randomly clicks on a link from that page, then a link from this new page, and so on. This surfer simulates a random walk on the graph of the web.

一般来说,在代表万维网的图表上,这样的随机冲浪者会从某个节点遍历图表到其邻居之一(如果存在指向该页面的链接,则返回到自身)。在本章中,我们将再次遇到万维网,并探讨我们想要了解的有关其图的本质的问题。我们需要一个用于随机游走的矩阵,对于此应用程序,我们将其称为连接矩阵,但实际上它是按每个顶点的度数加权的邻接矩阵。我们使用这个随机游走矩阵或链接矩阵来理解图上随机游走的长期行为。图上的随机游走将贯穿本章。

In general, on the graph representing the World Wide Web, such a random surfer traverses the graph from a certain node to one of its neighbors (or back to itself if there are links pointing back to the page). We will encounter the World Wide Web one more time in this chapter and explore the kind of questions that we like to understand about the nature of its graph. We need a matrix for the random walk, which for this application we call the linking matrix, but in reality it is the adjacency matrix weighted by the degree of each vertex. We use this random walk matrix, or linking matrix, to understand the long-term behavior of the random walk on the graph. Random walks on graphs will appear throughout this chapter.

回到图 9-3的四页万维网。如果网络冲浪者在页面 A,则有三分之一的机会该冲浪者将移动到页面 B,三分之一的机会移动到 C,三分之一的机会移动到 D。因此,向外链接向量A页是:

Back to the four-page World Wide Web of Figure 9-3. If the web surfer is at page A, there is a one-third chance the surfer will move to page B, one-third chance to move to C, and one-third chance to move to D. Thus, the outward linking vector of page A is:

nknG A = 0 1 / 3 1 / 3 1 / 3

如果网络冲浪者在页面 B,则有二分之一的机会他们将移动到页面 A,二分之一的机会他们将移动到页面 D。因此,页面 B 的外向链接向量为:

If the web surfer is at page B, there is a one-half chance they will move to page A and a one-half chance they will move to page D. Thus, the outward linking vector of page B is:

nknG = 1 / 2 0 0 1 / 2

同样,页面C和D的外链向量为:

Similarly, the outward linking vectors of pages C and D are:

nknG C = 0 0 0 1 nknG D = 0 1 / 2 1 / 2 0

我们将所有网页的链接向量捆绑在一起以创建链接矩阵:

We bundle the linking vectors of all the web pages together to create a linking matrix:

L n k n G = 0 1 / 2 0 0 1 / 3 0 0 1 / 2 1 / 3 0 0 1 / 2 1 / 3 1 / 2 1 0

请注意,链接矩阵的列是向外链接概率,A 的行是向内链接概率。冲浪者如何最终到达页面 A?他们只能在 B 处,并且从那里开始,他们最终到达A 的概率只有 0.5 。

Note that the columns of the linking matrix are the outward linking probabilities, and the rows of A are the inward linking probabilities. How can a surfer end up at page A? They can only be at B, and from there there’s only a 0.5 probability they will end up at A.

现在,我们可以通过将指向 A 的所有页面的排名相加来对页面 A 进行排名,每个页面都根据冲浪者从该页面最终到达页面 A 的概率进行加权也就是说,有许多排名靠前的页面指向该页面的页面也将排名靠前。因此,所有四个页面的排名为:

Now we can rank page A by adding up the ranks of all the pages pointing to A, each weighted by the probability a surfer will end up at page A from that page; that is, a page with many highly ranked pages pointing to it will also rank high. The ranks of all four pages are therefore:

r A n k A = 0 r A n k A + 1 / 2 r A n k + 0 r A n k C + 0 r A n k D r A n k = 1 / 3 r A n k A + 0 r A n k + 0 r A n k C + 1 / 2 r A n k D r A n k C = 1 / 3 r A n k A + 0 r A n k + 0 r A n k C + 1 / 2 r A n k D r A n k D = 1 / 3 r A n k A + 1 / 2 r A n k + 1 r A n k C + 0 r A n k D

为了找到每个网页排名的数值,我们必须求解线性方程组,这是线性代数的领域。用矩阵向量表示法,我们将系统写为:

To find the numerical value for the rank of each web page, we have to solve that system of linear equations, which is the territory of linear algebra. In matrix vector notation, we write the system as:

rAnks = L n k n G rAnks

因此,包含所有网页的所有排名的向量是网页图的链接矩阵(其中节点是网页,有向边是它们之间的链接)的特征向量,其特征值为1。回想一下,实际上,网络图是巨大的,这意味着链接矩阵是巨大的,设计有效的方法来找到其特征向量变得迫在眉睫。

Therefore, the vector containing all the ranks of all the web pages is an eigenvector of the linking matrix of the graph of the web pages (where the nodes are the web pages, and the directed edges are the links between them) with eigenvalue 1. Recall that in reality, the graph of the web is enormous, which means that the linking matrix is enormous, and devising efficient ways to find its eigenvectors becomes of immediate interest.

计算给定矩阵的特征向量和特征值是数值线性代数最重要的贡献之一,可直接应用于许多领域。许多用于查找特征向量和特征值的数值方法都涉及将矩阵与向量重复相乘。当处理巨大的矩阵时,这是昂贵的,我们必须使用书中的每一个技巧来使操作更便宜。我们利用矩阵的稀疏性(许多条目都是零,因此与这些条目相乘然后发现它们只是零是一种浪费);我们引入随机化或随机性,并冒险进入高维概率和大型随机矩阵领域(我们将在第 11 章概率中了解这些内容)。现在,我们再次强调第 6 章中介绍的关于奇异值分解的迭代方法:我们从随机向量开始 rAnks 0 ,然后通过乘以链接矩阵迭代生成向量序列:

Computing eigenvectors and eigenvalues of a given matrix is one of the most important contributions of numerical linear algebra, with immediate applications in many fields. A lot of the numerical methods for finding eigenvectors and eigenvalues involve repeatedly multiplying a matrix with a vector. When dealing with huge matrices, this is expensive, and we have to use every trick in the book to make the operations cheaper. We take advantage of the sparsity of the matrix (many entries are zeros, so it is a waste to multiply with these entries and then discover that they are just zeros); we introduce randomization or stochasticity, and venture into the fields of high-dimensional probability and large random matrices (we will get a flavor of these in Chapter 11 on probability). For now, we reemphasize the iterative method we introduced in Chapter 6 on singular value decompositions: we start with a random vector ranks 0 , then produce a sequence of vectors iteratively by multiplying by the linking matrix:

rAnks +1 = L n k n G rAnks

对于我们的四页万维网,这收敛于向量:

For our four-page World Wide Web, this converges to the vector:

rAnks = 0 12 0 24 0 24 0 4

这意味着页面 D 排名最高,并且在搜索引擎查询类似内容时,它将是第一个返回的页面。然后我们可以重新绘制图9-3中的图,其中每个圆圈的大小对应于重要性页面的。

which means that page D is ranked highest, and in a search engine query with similar content it will be the first page returned. We can then redraw the diagram in Figure 9-3 with the size of each circle corresponding to the importance of the page.

当使用 PageRank 算法时,实际实现包括一个阻尼因子d,一个介于 0 和 1 之间的数字,通常约为 0.85,这仅考虑到网络冲浪者点击其所在页面的链接的可能性为 85%他们有 15% 的机会从一个全新的页面开始,该页面没有来自他们当前所在页面的链接。这修改了迭代过程,以直接的方式查找网络页面的排名:

When the PageRank algorithm was in use, the real implementation included a damping factor d, a number between 0 and 1, usually around 0.85, which takes into account only an 85% chance that the web surfer clicks on a link from the page they are currently at, and a 15% chance that they start at a completely new page that has no links from the page they are currently at. This modifies the iterative process to find the rankings of the pages of the web in a straightforward way:

rAnks +1 = d L n k n G rAnks + 1-d 全部的数字页面 nes

最后,您可能想知道 Google 是否会不断在网络上搜索新网页并将其编入索引,并且是否会不断检查所有已索引的网页是否有新链接?答案是肯定的,以下摘录自Google 的《Google 搜索工作原理深度指南》

Finally, you might be wondering whether Google keeps searching the web for new web pages and indexing them, and does it keep checking all indexed web pages for new links? The answer is yes, and the following excerpts are from Google’s “In-Depth Guide to How Google Search Works”:

谷歌搜索是一个全自动搜索引擎,使用称为网络爬虫的软件定期探索网络以查找要添加到索引中的页面。事实上,我们结果中列出的绝大多数页面都不是手动提交的,而是在我们的网络爬虫探索网络时自动找到并添加的。[...​] 没有所有网页的中央注册表,因此 Google 必须不断寻找新的和更新的页面,并将它们添加到其已知页面列表中。这个过程称为“URL 发现”。有些页面是已知的,因为 Google 已经访问过它们。当 Google 跟踪从已知页面到新页面的链接时,会发现其他页面:例如,中心页面(例如类别页面)链接到新博客文章。当您提交页面列表(站点地图)供 Google 抓取时,还会发现其他页面。[...​] 当用户输入查询时,我们的机器会在索引中搜索匹配的页面,并返回我们认为质量最高且与用户最相关的结果。相关性由数百个因素决定,其中可能包括用户位置、语言和设备(桌面或电话)等信息。例如,搜索“自行车修理店”会给巴黎的用户显示与香港的用户不同的结果。

Google Search is a fully-automated search engine that uses software known as web crawlers that explore the web regularly to find pages to add to our index. In fact, the vast majority of pages listed in our results aren’t manually submitted for inclusion, but are found and added automatically when our web crawlers explore the web. […​] There isn’t a central registry of all web pages, so Google must constantly look for new and updated pages and add them to its list of known pages. This process is called “URL discovery”. Some pages are known because Google has already visited them. Other pages are discovered when Google follows a link from a known page to a new page: for example, a hub page, such as a category page, links to a new blog post. Still other pages are discovered when you submit a list of pages (a sitemap) for Google to crawl. […​] When a user enters a query, our machines search the index for matching pages and return the results we believe are the highest quality and most relevant to the user. Relevancy is determined by hundreds of factors, which could include information such as the user’s location, language, and device (desktop or phone). For example, searching for “bicycle repair shops” would show different results to a user in Paris than it would to a user in Hong Kong.

我们收集的数据越多,搜索就会变得越复杂。谷歌推出了2015 年的 RankBrain。它使用机器学习对网页上的文本进行矢量化,类似于我们在第 7 章中所做的。此过程为索引页面添加上下文和含义,以便搜索返回更准确的结果。这个过程增加的坏处是与意义向量相关的维度要高得多。为了避免在返回最接近查询的网页之前检查每个维度的每个向量的困难,Google 使用近似最近邻算法,这有助于在几毫秒内返回出色的结果 - 我们的经验现在有。

The more data we collect, the more complex searching it becomes. Google rolled out RankBrain in 2015. It uses machine learning to vectorize the text on the web pages, similar to what we did in Chapter 7. This process adds context and meaning to the indexed pages, so that the search returns more accurate results. The bad thing that this process adds is the much higher dimensions associated with meaning vectors. To circumvent the difficulty of checking every vector at every dimension before returning the web pages closest to the query, Google uses an approximate nearest neighbor algorithm, which helps return excellent results in milliseconds—the experience we have now.

使用图反转矩阵

Inverting Matrices Using Graphs

很多问题在应用科学中涉及编写离散线性系统 A X = 并求解,相当于对矩阵A求逆并求解 X = A -1 。但对于大型矩阵来说,这是一项计算成本高昂的操作,并且存储要求高且精度差。我们一直在寻找有效的方法来求逆矩阵,有时会利用特定矩阵的特殊特性。

Many problems in the applied sciences involve writing a discrete linear system A x = b and solving it, which is equivalent to inverting the matrix A and finding the solution x = A -1 b . But for large matrices, this is a computationally expensive operation, along with having high storage requirements and poor accuracy. We are always looking for efficient ways to invert matrices, sometimes leveraging the special characteristics of the particular matrices at hand.

以下是计算适当大小的矩阵(例如,一百行和一百列)的逆矩阵的图论方法:

The following is a graph theoretic method that computes the inverse of a matrix of a decent size (for example, a hundred rows and a hundred columns):

  1. 将矩阵 A 中的每个非零条目替换为 1。我们得到一个二元矩阵。

  2. Replace each nonzero entry in matrix A with a 1. We obtain a binary matrix.

  3. 排列生成的二进制矩阵的行和相应的列,使所有对角线条目均为 1。

  4. Permute the rows and the corresponding columns of the resulting binary matrix to make all diagonal entries 1’s.

  5. 我们将获得的矩阵视为有向图的邻接矩阵(其中我们从图中沿对角线删除对应于 1 的自循环)。

  6. We think of the matrix obtained as the adjacency matrix of a directed graph (where we delete the self-loops corresponding to 1’s along the diagonal from the graph).

  7. 生成的有向图被划分为多个片段。

  8. The resulting directed graph is partitioned into its fragments.

  9. 如果碎片太大,我们会通过移除适当的边缘将其撕成更小的碎片。

  10. If a fragment is too large, we tear it into smaller fragments by removing an appropriate edge.

  11. 我们对较小的矩阵求逆。

  12. We invert the smaller matrices.

  13. 显然这会导致原始矩阵的逆。

  14. Apparently this leads to the inverse of the original matrix.

我们不会解释为什么以及如何,但是这个方法太可爱了,所以它进入了这个章节。

We will not explain why and how, but this method is so cute, so it made its way into this chapter.

群的凯莱图:纯代数和并行计算

Cayley Graphs of Groups: Pure Algebra and Parallel Computing

图表群,也称为凯莱图或凯莱图,有助于设计和分析并行计算机的网络体系结构、路由问题以及互连网络的路由算法。论文“来自 Cayley 图的处理器互连网络”(Schibell 等人,2011 年)是关于将 Cayley 图应用于并行计算网络的早期设计的有趣且易于阅读的内容,并解释了如何构建满足特定设计参数的 Cayley 图。凯莱图也已应用于数据分类

Graphs of groups, also called Cayley graphs or Cayley diagrams, can be helpful in designing and analyzing network architectures for parallel computers, routing problems, and routing algorithms for interconnected networks. The paper “Processor Interconnection Networks from Cayley Graphs” (Schibell et al. 2011) is an interesting and easy read on earlier designs applying Cayley graphs for parallel computing networks, and explains how to construct Cayley graphs that meet specific design parameters. Cayley graphs have also been applied for classification of data.

我们可以将具有n 个元素的每个组表示为n 个节点的连接有向图,其中每个节点对应于该组中的一个元素,每条边表示与该组中的生成器的乘法。边的标记(或颜色)取决于我们要乘以的组中的哪个生成器(见图9-4)。该有向图唯一地定义了组:组中元素的每个乘积对应于图上的一系列有向边。例如,n 个元素的循环群的图是n 个节点的有向电路,其中每条边表示与该群的一个生成器的乘法。

We can represent every group with n elements as a connected directed graph of n nodes, where each node corresponds to an element from the group, and each edge represents a multiplication by a generator from the group. The edges are labeled (or colored) depending on which generator from the group we are multiplying by (see Figure 9-4). This directed graph uniquely defines the group: each product of elements in the group corresponds to following a sequence of directed edges on the graph. For example, the graph of a cyclic group of n elements is a directed circuit of n nodes in which every edge represents multiplication by one generator of the group.

从纯数学的角度来看,凯莱图对于可视化和研究抽象组非常有用,可以在可视化图表中编码其完整的抽象结构及其所有元素。凯莱图的对称性使其可用于构造更多复杂的抽象对象。这些是组合和几何群论的核心工具。有关凯莱图的更多信息,请查看Wolfram Mathworld 页面

From a pure math perspective, Cayley graphs are useful for visualizing and studying abstract groups, encoding their full abstract structure and all of their elements in a visual diagram. The symmetry of Cayley graphs makes them useful for constructing more involved abstract objects. These are central tools for combinatorial and geometric group theory. For more on Cayley graphs, check out this Wolfram Mathworld page.

埃麦0904
图 9-4。两个生成器 a 和 b 上的自由群的凯莱图;每个节点代表自由组的一个元素,每条边代表乘以a或b(图片来源

图内的消息传递

Message Passing Within a Graph

消息传递框架一种有用的方法,用于对图中的信息传播进行建模,以及将节点、边和图本身结构中传递的信息巧妙地聚合成特定所需维度的向量。在此框架内,我们使用来自其相邻节点的特征向量以及与其连接的边的信息来更新每个节点。图神经网络执行多轮消息传递;每轮都会进一步传播单个节点的信息。最后,我们结合每个单独节点的潜在特征以获得其统一的向量表示并表示整个图。

The message passing framework is a useful approach for modeling the propagation of information within graphs, as well as neatly aggregating the information conveyed in the nodes, edges, and the structure of the graph itself into vectors of a certain desired dimension. Within this framework, we update every node with information from the feature vectors of its neighboring nodes and the edges connected to them. A graph neural network performs multiple rounds of message passing; each round propagates a single node’s information further. Finally, we combine the latent features of each individual node to obtain its unified vector representation and represent the whole graph.

更具体地说,对于特定节点,我们选择一个函数,该函数将节点的特征向量作为输入,即其相邻节点之一(通过边与其连接的节点)的特征向量,以及将其连接到该相邻节点的边的特征向量,并输出一个新向量,其中包含来自该节点、邻居的信息,以及它们的连接边。我们将相同的函数应用于节点的所有邻居,然后将所得向量加在一起,生成消息向量。最后,我们通过将节点的原始特征向量与我们选择的更新函数中的消息向量相结合来更新节点的特征向量。当我们对图中的每个节点执行此操作时,每个节点的新特征向量将包含来自其自身、其所有邻居以及其所有连接边的信息。现在,当我们再次重复此过程时,节点的最新特征向量将包含来自其自身、其所有邻居及其邻居的邻居以及所有相应连接边的信息。因此,我们进行的消息传递轮数越多,每个节点的特征向量包含的来自图中较远节点的信息就越多,一次移动一个边缘间隔。信息陆续传遍四面八方整个网络。

More concretely, for a specific node, we choose a function that takes as input the node’s feature vector, the feature vector of one of its neighboring nodes (those connected to it by an edge), and the feature vector of the edge that connects it to this neighboring node, and outputs a new vector that contains within it information from the node, the neighbor, and their connecting edge. We apply the same function to all of the node’s neighbors, then add the resulting vectors together, producing a message vector. Finally, we update the feature vector of our node by combining its original feature vector with the message vector within an update function that we also choose. When we do this for each node in the graph, each node’s new feature vector will contain information from itself, all its neighbors, and all its connecting edges. Now, when we repeat this process one more time, the node’s most recent feature vector will contain information from itself, all its neighbors and its neighbor’s neighbors, and all the corresponding connecting edges. Thus, the more message passing rounds we do, the more each node’s feature vector contains information from farther nodes within the graph, moving one edge separation at a time. The information diffuses successively across the entire network.

图的无限应用

The Limitless Applications of Graphs

应用领域对于图神经网络和一般的图模型来说,它们无处不在且非常重要,以至于我有点后悔没有从图开始这本书。在任何图模型中,我们首先回答以下问题:

Applications for graph neural networks, and graph models in general, are ubiquitous and so important that I am a bit regretful I did not start the book with graphs. In any graph model, we start by answering the following questions:

  • 节点是什么?

  • What are the nodes?

  • 连接两个节点、在它们之间建立有向或无向边的关系是什么?

  • What is the relationship that links two nodes, that establishes directed or undirected edge(s) between them?

  • 模型是否应该包含节点和/或边的特征向量?

  • Should the model include feature vectors for the nodes and/or the edges?

  • 我们的模型是动态的(节点、边及其特征随时间变化)还是静态的?

  • Is our model dynamic, where the nodes, edges, and their features evolve with time, or is it static in time?

  • 我们对什么感兴趣?分类(例如癌症或非癌症;假新闻传播者或真实新闻传播者)?生成新图表(例如用于药物发现)?聚类?将图嵌入到低维结构化空间中?

  • What are we interested in? Classifying (for example cancerous or noncancerous; fake news spreader or real news spreader)? Generating new graphs (for example for drug discovery)? Clustering? Embedding the graph into a lower-dimensional and structured space?

  • 可用或需要什么类型的数据,数据是否经过组织和/或标记?需要预处理吗?

  • What kind of data is available or needed, and is the data organized and/or labeled? Does it need preprocessing?

我们在本节中调查了一些应用程序,但还有更多真正适合图形建模结构的应用程序。阅读链接出版物的摘要很有好处,因为它们有助于捕捉共同的主题和思考这些模型的方式。以下列表很好地说明了常见任务图神经网络,其中包括:

We survey few applications in this section, but there are many more that genuinely lend themselves to a graph modeling structure. It is good to read the abstracts of the linked publications since they help capture common themes and ways of thinking about these models. The following list gives a good idea of the common tasks for graph neural networks, which include:

  • 节点分类

  • Node classification

  • 图分类

  • Graph classification

  • 聚类和社区检测

  • Clustering and community detection

  • 新图生成

  • New graph generation

  • 影响力最大化

  • Influence maximization

  • 链接预测

  • Link prediction

图像数据作为图表

Image Data as Graphs

我们可能会遇到在MNIST 数据集上测试手写数字的图神经网络,这是其中之一计算机视觉基准集。如果您想知道图像数据(存储为每个通道的像素强度的三维张量)如何设法适应图形结构,以下是它的工作原理。每个像素是一个节点,其特征是其三个通道各自的强度(如果是彩色图像,否则只有一个特征)。边缘将每个像素与其周围的三个、五个或八个像素连接起来,具体取决于像素是位于图像的角落、边缘还是中间。

We might encounter graph neural networks tested on the MNIST data set for handwritten digits, which is one of the benchmark sets for computer vision. If you wonder how image data (stored as three-dimensional tensors of pixel intensities across each channel) manages to fit into a graph structure, here’s how it works. Each pixel is a node, and its features are the respective intensities of its three channels (if it is a color image, otherwise it only has one feature). The edges connect each pixel to the three, five, or eight pixels surrounding it, depending on whether the pixel is located at a corner, edge, or in the middle of the image.

大脑网络

Brain Networks

之一神经科学的主要追求是了解大脑的网络组织。图模型提供了一个自然的框架和许多工具来分析大脑的复杂网络,无论是在解剖结构还是功能方面。

One of the main pursuits in neuroscience is understanding the network organization of the brain. Graph models provide a natural framework and many tools for analyzing the complex networks of the brain, both in terms of their anatomy and their functionality.

为了创造与人类智能相当的人工智能,我们必须在多个层面上了解人脑。一方面是大脑的网络连接,连接如何影响大脑的功能,以及如何复制它,从小型计算单元到模块化组件,再到完全独立的功能系统。

To create artificial intelligence on par with human intelligence, we must understand the human brain on many levels. One aspect is the brain’s network connectivity, how connectivity affects the brain’s functionality, and how to replicate that, building up from small computational units to modular components to a fully independent and functional system.

人脑解剖网络表现出较短的路径长度(节省布线成本)以及高度的皮质中心,即高度聚类。这是在细胞尺度和整个大脑尺度上的。换句话说,大脑网络似乎以一种最大化信息传输效率和最小化连接成本的方式组织自身。该网络还展示了模块化和分层的拓扑结构和功能。大脑网络的拓扑结构及其功能在短期和长期尺度上都是相互依赖的。网络的动态特性受到其结构连通性的影响,并且在较长的时间尺度上,动态特性会影响网络的拓扑结构。

Human brain anatomical networks demonstrate short path length (conservation of wiring costs) along with high-degree cortical hubs, that is, high clustering. This is on both the cellular scale and on the whole brain scale. In other words, the brain network seems to have organized itself in a way that maximizes the efficiency of information transfer and minimizes connection cost. The network also demonstrates modular and hierarchical topological structures and functionalities. The topological structures of the brain networks and their functionalities are interdependent over both short and long time scales. The dynamic properties of the networks are affected by their structural connectivity, and over a longer time scale, the dynamics affect the topological structure of the network.

最重要的问题是:大脑的网络特性与其认知行为之间的关系是什么?网络属性与大脑和精神障碍有何关系?例如,我们可以将神经精神疾病视为断开综合症,其中图论可以帮助量化网络结构中的弱点、病变脆弱性和异常。事实上,图论已被应用于研究精神分裂症、阿尔茨海默氏症和其他疾病的结构和功能网络特性。

The most important questions are: what is the relationship between the network properties of the brain and its cognitive behavior? What is the relationship between the network properties and brain and mental disorders? For example, we can view neuropsychiatric disorders as disconnectivity syndromes, where graph theory can help quantify weaknesses, vulnerability to lesions, and abnormalities in the network structures. In fact, graph theory has been applied to study the structural and functional network properties in schizophrenia, Alzheimer’s, and other disorders.

疾病传播

Spread of Disease

和我们一样大家都从 COVID-19 大流行中吸取了教训,准确可靠地预测疾病事件对于缓解疫情、检疫措施、政策和许多其他决策因素至关重要。图模型可以将个体或整个地理块视为节点,并将这些个体或块之间的接触事件视为边。最近用于预测 COVID-19 传播的模型,例如文章“结合图神经网络和时空疾病模型以改善德国每周 COVID-19 病例的预测”(Fritz 等人,2022 年),纳入了人员流动数据Facebook、Apple 和 Google 对其模型中的节点之间的交互进行建模。

As we have all learned from the COVID-19 pandemic, it is of crucial importance to be able to forecast disease incidents accurately and reliably for mitigation purposes, quarantine measures, policy, and many other decision factors. A graph model can consider either individuals or entire geographic blocks as nodes, and contact occurrences between these individuals or blocks as edges. Recent models for predicting COVID-19 spread, for example, the article “Combining Graph Neural Networks and Spatio-temporal Disease Models to Improve the Prediction of Weekly COVID-19 Cases in Germany” (Fritz et al. 2022), incorporate human mobility data from Facebook, Apple, and Google to model interactions between the nodes in their models.

这里有大量数据可以充分利用。Facebook 的“Data for Good”资源拥有有关人口密度、社会流动性和旅行模式、社会联系等方面的大量数据。谷歌的COVID-19 社区流动性报告将谷歌地图和其他产品的洞察融入到一个数据集中,该数据集按地理位置绘制了不同类别地点随时间的流动趋势,例如零售和娱乐、杂货和药房、公园、中转站、工作场所、和住宅区。同样,苹果和亚马逊的移动数据也有类似的目的,即帮助限制 COVID-19 的传播。

There is plenty of data that can be put to good use here. Facebook’s “Data for Good” resource has a wealth of data on population densities, social mobility and travel patterns, social connectedness, and others. Google’s COVID-19 Community Mobility Reports draw insights from Google Maps and other products into a data set that charts movement trends over time by geography across different categories of places, such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential areas. Similarly, Apple’s and Amazon’s mobility data serve a similar purpose with the goal of aiding efforts to limit the spread of COVID-19.

信息传播

Spread of Information

我们可以使用图表来模拟信息、疾病、谣言、八卦、计算机病毒、创新想法或其他的传播。这样的模型通常是一个有向图,其中每个节点对应一个个体,边上标记有个体之间交互的信息。边缘标签或权重通常是概率。重量 w j 的边缘连接 n d e n d e j 是某种效应(疾病、谣言、计算机病毒)传播的概率 n d e n d e j

We can use graphs to model the spread of information, disease, rumors, gossip, computer viruses, innovative ideas, or others. Such a model is usually a directed graph, where each node corresponds to an individual, and the edges are tagged with information about the interaction between individuals. The edge tags, or weights, are usually probabilities. The weight w ij of the edge connecting n o d e i to n o d e j is the probability of a certain effect (disease, rumor, computer virus) propagating from n o d e i to n o d e j .

检测和跟踪假新闻传播

Detecting and Tracking Fake News Propagation

图神经网络在检测假新闻的任务中(见图9-5)比基于内容的自然语言处理方法表现更好。“使用几何深度学习在社交媒体上检测假新闻”(Monti et al. 2019)论文的摘要内容丰富:

Graph neural networks perform better in the task of detecting fake news (see Figure 9-5) than content-based natural language processing approaches. The abstract of the paper “Fake News Detection on Social Media using Geometric Deep Learning” (Monti et al. 2019) is informative:

社交媒体因其低成本、易于访问和快速传播而成为当今全球数百万人的主要新闻来源之一。然而,这是以可信度可疑和暴露于故意误导读者的“假新闻”的巨大风险为代价的。自动检测假新闻带来的挑战与现有的基于内容的分析方法相悖。主要原因之一是,新闻的解释通常需要政治或社会背景或“常识”的知识,而当前的自然语言处理算法仍然缺乏这些知识。最近的研究经验表明,假新闻和真实新闻在社交媒体上的传播方式不同,形成了可用于自动假新闻检测的传播模式。与基于内容的方法相比,基于传播的方法具有多种优势,其中包括语言独立性和更好的抵御对抗性攻击的能力。在本文中,我们展示了一种基于几何深度学习的新型自动假新闻检测模型。底层核心算法是经典卷积神经网络到图的推广,允许融合异构数据,例如内容、用户配置文件和活动、社交图和新闻传播。我们的模型根据在 Twitter 上传播的新闻报道进行了训练和测试,并由专业事实核查组织进行了验证。我们的实验表明,社交网络结构和传播是实现高精度 (92.7% ROC AUC) 假新闻检测的重要特征。其次,我们观察到假新闻可以在传播短短几个小时后的早期阶段被可靠地检测到。第三,我们在时间上分开的训练和测试数据上测试模型的老化情况。我们的结果表明,基于传播的假新闻检测方法有望作为基于内容的替代或补充策略接近。

Social media are nowadays one of the main news sources for millions of people around the globe due to their low cost, easy access, and rapid dissemination. This however comes at the cost of dubious trustworthiness and significant risk of exposure to ‘fake news’, intentionally written to mislead the readers. Automatically detecting fake news poses challenges that defy existing content-based analysis approaches. One of the main reasons is that often the interpretation of the news requires the knowledge of political or social context or ‘common sense’, which current natural language processing algorithms are still missing. Recent studies have empirically shown that fake and real news spread differently on social media, forming propagation patterns that could be harnessed for the automatic fake news detection. Propagation based approaches have multiple advantages compared to their content based counterparts, among which is language independence and better resilience to adversarial attacks. In this paper, we show a novel automatic fake news detection model based on geometric deep learning. The underlying core algorithms are a generalization of classical convolutional neural networks to graphs, allowing the fusion of heterogeneous data such as content, user profile and activity, social graph, and news propagation. Our model was trained and tested on news stories, verified by professional fact checking organizations, that were spread on Twitter. Our experiments indicate that social network structure and propagation are important features allowing highly accurate (92.7% ROC AUC) fake news detection. Second, we observe that fake news can be reliably detected at an early stage, after just a few hours of propagation. Third, we test the aging of our model on training and testing data separated in time. Our results point to the promise of propagation based approaches for fake news detection as an alternative or complementary strategy to content based approaches.

埃麦0905
图 9-5。传播假新闻的节点被标记为红色;想法相似的人在社交网络中聚集在一起(有关此图像的彩色版本,请参阅图像源)

网络规模的推荐系统

Web-Scale Recommendation Systems

自2018年以来,Pinterest 一直在使用PinSage 图卷积网络。这会管理用户的主页并为新的相关 pin 提供建议。作者在他们的模型中利用了图上的随机游走,我们将在本章后面讨论。这是完整的摘要:

Since 2018, Pinterest has been using the PinSage graph convolutional network. This curates users’ home feed and makes suggestions for new and relevant pins. The authors utilize random walks on graphs in their model, which we will discuss later in this chapter. Here is the full abstract:

图结构数据深度神经网络的最新进展已经在推荐系统基准测试中实现了最先进的性能。然而,使这些方法实用并可扩展到涉及数十亿个项目和数亿用户的网络规模推荐任务仍然是一个挑战。在这里,我们描述了我们在 Pinterest 开发和部署的大规模深度推荐引擎。我们开发了一种数据高效的图卷积网络(GCN)算法 PinSage,它将高效的随机游走和图卷积结合起来,生成包含图结构和节点特征信息的节点(即项目)嵌入。与之前的 GCN 方法相比,我们开发了一种基于高效随机游走的新方法来构造卷积,并设计了一种新的训练策略,该策略依赖于越来越难的训练示例来提高模型的鲁棒性和收敛性。我们在 Pinterest 上部署 PinSage,并在图表上的 75 亿个示例上对其进行训练,其中 30 亿个节点代表引脚和板,以及 180 亿条边。根据离线指标、用户研究和 A/B 测试,PinSage 生成的推荐质量高于同类深度学习和基于图形的替代方案。据我们所知,这是迄今为止深度图嵌入的最大应用,并为基于图卷积架构的新一代网络规模推荐系统铺平了道路。

Recent advancements in deep neural networks for graph-structured data have led to state-of-the-art performance on recommender system benchmarks. However, making these methods practical and scalable to web-scale recommendation tasks with billions of items and hundreds of millions of users remains a challenge. Here we describe a large-scale deep recommendation engine that we developed and deployed at Pinterest. We develop a data efficient Graph Convolutional Network (GCN) algorithm PinSage, which combines efficient random walks and graph convolutions to generate embeddings of nodes (i.e., items) that incorporate both graph structure as well as node feature information. Compared to prior GCN approaches, we develop a novel method based on highly efficient random walks to structure the convolutions and design a novel training strategy that relies on harder-and-harder training examples to improve robustness and convergence of the model. We deploy PinSage at Pinterest and train it on 7.5 billion examples on a graph with 3 billion nodes representing pins and boards, and 18 billion edges. According to offline metrics, user studies and A/B tests, PinSage generates higher-quality recommendations than comparable deep learning and graph-based alternatives. To our knowledge, this is the largest application of deep graph embeddings to date and paves the way for a new generation of web-scale recommender systems based on graph convolutional architectures.

对抗癌症

Fighting Cancer

在里面在文章“HyperFoods:食品中抗癌分子的机器智能绘图”(Veselkov 等人,2019 年)中,作者使用蛋白质、基因和药物相互作用数据来识别有助于预防和战胜癌症的分子。他们还绘制了抗癌分子最丰富的食物图(见图9-6)。作者再次在图上使用随机游走。这是论文的摘要:

In the article “HyperFoods: Machine Intelligent Mapping of Cancer-Beating Molecules in Foods” (Veselkov et al. 2019), the authors use protein, gene, and drug interaction data to identify the molecules that help prevent and beat cancer. They also map the foods that are the richest in cancer-beating molecules (see Figure 9-6). Again, the authors utilize random walks on graphs. Here’s the abstract of the paper:

最近的数据表明,仅通过饮食和生活方式措施就可以预防高达 30-40% 的癌症。在这里,我们引入了一个独特的基于网络的机器学习平台来识别假定的基于食物的抗癌分子。这些已通过其分子生物网络与临床批准的抗癌疗法的共性而被识别。图上随机游走的机器学习算法(在超级计算 DreamLab 平台内运行)被用来模拟药物对人类相互作用组网络的作用,以获得 1962 种批准药物(其中 199 种被归类为“抗癌药物”)的全基因组活性谱。 ”及其主要适应症)。采用监督方法利用这些“习得的”相互作用组活动谱来预测抗癌分子。经过验证的模型性能预测抗癌治疗的分类准确度为 84-90%。将食品中 7962 种生物活性分子的综合数据库输入该模型,该模型预测了 110 种抗癌分子(由 >70% 的抗癌药物相似阈值定义),其预期能力可与多种临床批准的抗癌药物相媲美。化学类别包括类黄酮、萜类化合物和多酚。这反过来又被用来构建“食物地图”,其中每种成分的抗癌潜力由其中发现的抗癌分子的数量来定义。我们的分析支持下一代癌症预防和治疗营养的设计策略。

Recent data indicate that up to 30–40% of cancers can be prevented by dietary and lifestyle measures alone. Herein, we introduce a unique network-based machine learning platform to identify putative food-based cancer-beating molecules. These have been identified through their molecular biological network commonality with clinically approved anti-cancer therapies. A machine-learning algorithm of random walks on graphs (operating within the supercomputing DreamLab platform) was used to simulate drug actions on human interactome networks to obtain genome-wide activity profiles of 1962 approved drugs (199 of which were classified as “anti-cancer” with their primary indications). A supervised approach was employed to predict cancer-beating molecules using these ‘learned’ interactome activity profiles. The validated model performance predicted anti-cancer therapeutics with classification accuracy of 84–90%. A comprehensive database of 7962 bioactive molecules within foods was fed into the model, which predicted 110 cancer-beating molecules (defined by anti-cancer drug likeness threshold of >70%) with expected capacity comparable to clinically approved anti-cancer drugs from a variety of chemical classes including flavonoids, terpenoids, and polyphenols. This in turn was used to construct a ‘food map’ with anti-cancer potential of each ingredient defined by the number of cancer-beating molecules found therein. Our analysis underpins the design of next-generation cancer preventative and therapeutic nutrition strategies.

埃麦0906
图 9-6。食品中抗癌分子的机器智能绘图;节点越大,抗癌分子就越多样化(图片来源

生化图

Biochemical Graphs

我们可以将分子和化合物表示为图,其中节点是原子,边是它们之间的化学键。来自该化学信息学领域的数据集对于评估分类模型的性能非常有用。例如,NCI1数据集包含约 4,100 种化合物,可用于抗癌筛选,其中化学物质被标记为阳性或阴性以阻止细胞肺癌。同一网站上提供了类似的蛋白质和其他化合物的标记图形数据集,以及使用它们的论文以及不同模型在这些数据集上的性能。

We can represent molecules and chemical compounds as graphs where the nodes are the atoms, and the edges are the chemical bonds between them. Data sets from this chemoinformatics domain are useful for assessing a classification model’s performance. For example, the NCI1 data set, containing around 4,100 chemical compounds, is useful for anti-cancer screens where the chemicals are labeled as positive or negative to hinder cell lung cancer. Similar labeled graph data sets for proteins and other compounds are available on the same website, along with the papers that employ them and the performance of different models on these data sets.

用于药物和蛋白质结构发现的分子图生成

Molecular Graph Generation for Drug and Protein Structure Discovery

在里面上一章我们学习了诸如变分自动编码器和对抗网络之类的生成网络如何从数据中学习联合概率分布,以便为各种目的生成外观相似的数据。图的生成网络建立在类似的想法之上;然而,它们比生成图像本身的网络更复杂一些。生成图网络要么以顺序方式生成新图,逐步输出节点和边,要么以全局方式生成新图,立即输出整个图的邻接矩阵。例如,请参阅这篇关于生成图网络的调查论文 (2020),了解有关该主题的详细信息。

In the last chapter we learned how generative networks such as variational autoencoders and adversarial networks learn joint probability distributions from the data in order to generate similar-looking data for various purposes. Generative networks for graphs build on similar ideas; however, they are a bit more involved than networks generating images per se. Generative graph networks generate new graphs either in a sequential manner, outputting nodes and edges step-by-step, or in a global manner, outputting a whole graph’s adjacency matrix at once. See, for example, this survey paper on generative graph networks (2020) for details on the topic.

引文网络

Citation Networks

引文网络,节点可以是作者,边是他们的共同作者;或者节点是论文,(有向)边是它们之间的引用。每篇论文都有指向它引用的论文的定向边缘。每篇论文的特征包括摘要、作者、年份、地点、标题、研究领域等。任务包括节点聚类、节点分类和链接预测。论文引用网络的流行数据集包括 CoRA、CiteSeerX 和 PubMed。CoRA数据集包含约三千篇机器学习出版物,分为七类。引文网络中的每篇论文都由一个单热向量表示,该向量指示预先指定的词典中是否存在某个单词,或者由术语频率-逆文档频率 (TF-IDF) 向量表示。随着更多论文加入网络,这些数据集不断更新。

In citation networks, the nodes could be the authors, and the edges are their coauthorships; or the nodes are papers, and the (directed) edges are the citations between them. Each paper has directed edges pointing to the papers it cites. Features for each paper include its abstract, authors, year, venue, title, field of study, and others. Tasks include node clustering, node classification, and link prediction. Popular data sets for paper citation networks include CoRA, CiteSeerX, and PubMed. The CoRA data set contains around three thousand machine learning publications grouped into seven categories. Each paper in the citation networks is represented by a one-hot vector indicating the presence or absence of a word from a prespecified dictionary, or by a term frequency-inverse document frequency (TF-IDF) vector. These data sets are updated continuously as more papers join the networks.

社交媒体网络和社会影响力预测

Social Media Networks and Social Influence Prediction

社交媒体网络,Facebook、Twitter、Instagram 和 Reddit 等,是我们这个时代(2010 年之后)的显着特征。Reddit数据集是可用数据集的一个示例。这是一个图表,其中节点是帖子,边是具有来自同一用户的评论的两个帖子之间的边。这些帖子还标有其所属的社区。

Social media networks, such as Facebook, Twitter, Instagram, and Reddit, are a distinctive feature of our time (after 2010). The Reddit data set is an example of the available data sets. This is a graph where the nodes are the posts and the edges are between two posts that have comments from the same user. The posts are also labeled with the community to which they belong.

社交媒体网络及其社会影响力对我们的社会产生了重大影响,从广告到赢得总统选举,再到推翻政治政权。表示社交网络的图模型的一项重要任务是预测网络中节点的社交影响力。在这里,节点是用户,他们的交互是边。特征包括用户的性别、年龄、性别、位置、活动水平等。量化社会影响力(目标变量)的一种方法是根据网络中近邻的行为来预测用户的行为。例如,如果用户的朋友购买了产品,那么在给定的时间段后他们购买相同产品的概率是多少?图上的随机游走有助于预测某些节点的社会影响一个网络。

Social media networks and their social influence have a substantial impact on our societies, ranging from advertising to winning presidential elections to toppling political regimes. One important task for a graph model representing social networks is to predict the social influence of the nodes in the network. Here, the nodes are the users, and their interactions are the edges. Features include users’ gender, age, sex, location, activity level, and others. One way to quantify social influence, the target variable, is through predicting the actions of a user given the actions of their near neighbors in the network. For example, if a user’s friends buy a product, what is the probability that they will buy the same product after a given period of time? Random walks on graphs help predict the social influence of certain nodes in a network.

社会学结构

Sociological Structures

社交图是表示社会中个体之间或个体群体之间关系的有向图。节点是社会或群体的成员,有向边是这些成员之间的关系,例如钦佩、关联、影响等。我们对这些社交图表中的连通性、可分离性、碎片的大小等感兴趣。一个例子来自人类学研究,其中许多部落根据其亲属关系结构进行分类。

Social diagrams are directed graphs that represent relationships among individuals in a society or among groups of individuals. The nodes are the members of the society or the groups, and the directed edges are the relationships between these members, such as admiration, association, influence, and others. We are interested in the connectedness, separability, size of fragments, and so forth in these social diagrams. One example is from anthropological studies where a number of tribes are classified according to their kinship structures.

贝叶斯网络

Bayesian Networks

稍后在这本章我们将讨论贝叶斯网络。这些是概率图模型,其目标是我们在人工智能领域非常熟悉的目标:学习数据集特征的联合概率分布。贝叶斯网络将此联合概率分布视为仅以表示数据特征之间关系的图中节点的父节点为条件的单变量分布的乘积。也就是说,节点是特征变量,边是我们认为相互连接的特征之间的边。应用包括垃圾邮件过滤、语音识别以及编码和解码等。

Later in this chapter we will discuss Bayesian networks. These are probabilistic graph models whose goal is one we are very familiar with in the AI field: to learn joint probability distribution of the features of a data set. Bayesian networks consider this joint probability distribution as a product of single variable distributions conditional only on a node’s parents in a graph representing the relationships between the features of the data. That is, the nodes are the feature variables and the edges are between the features that we believe to be connected. Applications include spam filtering, voice recognition, and coding and decoding, to name a few.

交通预测

Traffic Forecasting

交通预测是使用历史路线图、道路速度和交通量数据预测交通量的任务。我们可以使用基准流量数据集来跟踪进度并比较模型。例如,METR-LA 数据集是一个时空图,包含洛杉矶县高速公路上 207 个传感器收集的四个月的交通数据。交通网络是一个图,其中节点是传感器,边是这些传感器之间的路段。在某个时间t,特征是交通参数,例如速度和流量。图神经网络的任务是在经过一定时间后预测图的特征。

Traffic prediction is the task of predicting traffic volumes using historical roadmaps, road speed, and traffic volume data. There are benchmark traffic data sets that we can use to track progress and compare models. For example, the METR-LA data set is a spatial-temporal graph, containing four months of traffic data collected by 207 sensors on the highways of Los Angeles County. The traffic network is a graph, where the nodes are the sensors, and the edges are the road segments between these sensors. At a certain time t, the features are traffic parameters, such as velocity and volume. The task of a graph neural network is to predict the features of the graph after a certain time has elapsed.

其他交通预测模型采用贝叶斯网络,例如相邻道路链路之间的交通流量。该模型使用相邻道路链接的信息来分析焦点链接的趋势。

Other traffic forecasting models employ Bayesian networks, such as traffic flows among adjacent road links. The model uses information from adjacent road links to analyze the trends of focus links.

物流与运筹学

Logistics and Operations Research

我们可以使用图来建模和解决运筹学中的许多问题,例如交通问题和活动网络。所涉及的图通常是加权有向图。运筹学问题本质上是组合问题,如果网络很小,那么这些问题总是微不足道的。然而,对于大型现实世界网络来说,挑战在于找到有效的算法来筛选巨大的搜索空间并快速排除其中的大部分。很大一部分研究文献涉及估计此类算法的计算复杂性。这称为组合优化。典型问题包括旅行推销员问题、供应链优化、共享乘车路线和票价、工作匹配等。用于此类问题的一些图方法和算法包括最小生成树、最短路径、最大流最小割和图匹配。我们将在本书后面介绍运筹学示例。

We can model and solve many problems in operations research, such as transportation problems and activity networks, using graphs. The graphs involved are usually weighted directed graphs. Operations research problems are combinatorial in nature, and are always trivial if the network is small. However, for large real-world networks, the challenge is finding efficient algorithms that can sift through the enormous search space and quickly rule out big parts of it. A large part of the research literature deals with estimating the computational complexity of such algorithms. This is called combinatorial optimization. Typical problems include the traveling salesman problem, supply chain optimization, shared rides routing and fares, job matching, and others. Some of the graph methods and algorithms used for such problems are minimal spanning trees, shortest paths, max-flow min-cuts, and matching in graphs. We will visit operations research examples later in this book.

语言模型

Language Models

图模型与各种自然语言任务相关。这些任务表面上看起来不同,但其中许多都可以归结为聚类,而图模型非常适合聚类。

Graph models are relevant for a variety of natural language tasks. These tasks seem different at the surface but many of them boil down to clustering, for which graph models are very well suited.

对于任何应用程序,我们必须首先选择每个节点、边和特征所代表的内容。对于自然语言,这些选择揭示了语言和语言语料库中隐藏的结构和规律。

For any application we must first choose what the nodes, edges, and features for each represent. For natural language, these choices reveal hidden structures and regularities in the language and in the language corpuses.

在图模型中,我们不是将自然语言句子表示为循环模型的标记序列或变压器的标记向量,而是将句子嵌入到图中,然后采用图深度学习(或图神经网络)。

Instead of representing a natural language sentence as a sequence of tokens for recurrent models or as a vector of tokens for transformers, in graph models we embed sentences in a graph, then employ graph deep learning (or graph neural networks).

计算语言学的一个例子是构造解析语言的图,如图9-7所示。

One example from computational linguistics is constructing diagrams for parsing language, as shown in Figure 9-7.

埃麦0907
图 9-7。一个已解析的句子

节点是单词、n-gram 或短语,边是它们之间的关系,这取决于语言语法或句法(冠词、名词、动词等)。语言被定义为根据其语法规则从语言词汇正确生成的所有字符串的集合。从这个意义上说,计算机语言很容易解析(它们是以这种方式构建的),而自然语言由于其复杂的性质而很难完全指定。

The nodes are words, n-grams, or phrases, and the edges are the relationships between them, which depend on the language grammar or syntax (article, noun, verb, etc.). A language is defined as the set of all strings correctly generated from the language vocabulary according to its grammar rules. In that sense, computer languages are easy to parse (they are built that way), while natural languages are much harder to specify completely due to their complex nature.

解析

Parsing

解析手段将输入流转换为结构化或形式表示,以便可以自动处理。解析器的输入可能是句子、单词甚至字符。输出是一个树形图,其中包含有关输入每个部分的功能的信息。我们的大脑是语言输入的出色解析器。计算机解析编程语言。

Parsing means converting a stream of input into a structured or formal representation so it can be automatically processed. The input to a parser might be sentences, words, or even characters. The output is a tree diagram containing information about the function of each part of the input. Our brain is a great parser for language inputs. Computers parse programming languages.

另一个例子是新闻聚类或文章推荐。在这里,我们使用文本数据的图嵌入来确定文本相似度。节点可以是单词,边可以是单词之间的语义关系,或者只是它们的共现。或者节点可以是单词和文档,边也可以是语义或共现关系。节点和边的特征可以包括作者、主题、时间段等。聚类在此类图中自然出现。

Another example is news clustering or article recommendations. Here, we use graph embeddings of text data to determine text similarity. The nodes can be words and the edges can be semantic relationships between the words, or just their co-occurrences. Or the nodes can be words and documents, and the edges can again be semantic or co-occurrence relationships. Features for nodes and edges can include authors, topics, time periods, and others. Clusters emerge naturally in such graphs.

另一种不依赖于语言的语法或文法的解析类型是抽象意义表示(AMR)。相反,它依赖于语义表示,即含义相似的句子应该分配相同的抽象含义表示,即使它们的措辞不相同。抽象意义表示图是有根图、标记图、有向图、无环图,表示完整的句子。这些对于机器翻译和自然语言理解很有用。有用于抽象含义表示解析、可视化和表面生成的包和库,以及公开可用的数据集。

Another type of parsing that does not depend on the syntax or grammar of a language is abstract meaning representation (AMR). It relies instead on semantic representation, in the sense that sentences that are similar in meaning should be assigned the same abstract meaning representation, even if they are not identically worded. Abstract meaning representation graphs are rooted, labeled, directed, acyclic graphs, representing full sentences. These are useful for machine translation and natural language understanding. There are packages and libraries for abstract meaning representation parsing, visualization, and surface generation, as well as publicly available data sets.

对于其他自然语言应用程序,以下调查论文是一个很好的参考,很容易阅读以了解更多信息主题:“自然语言处理中的图形调查”(Nastase 等人,2015 年)。

For other natural language applications, the following survey paper is a nice reference that is easy to read to learn more about the subject: “A Survey of Graphs in Natural Language Processing” (Nastase et al. 2015).

网络的图结构

Graph Structure of the Web

自从自 1989 年万维网诞生以来,它已经得到了巨大的发展,并已成为全世界数十亿人不可或缺的工具。它允许使用互联网浏览器访问数十亿个网页、文档和其他资源。由于有数十亿个页面相互链接,因此研究网络的图形结构非常有意义。从数学上讲,这个庞大而广阔的图表本身就很令人着迷。但是,理解这张图比一次美丽的心理练习更重要,它提供了对网络爬行、索引和排名算法(如我们在本章前面看到的 PageRank 算法)、搜索社区和发现社交网络的算法的见解。以及其他表征其生长或衰退的现象。

Since the inception of the World Wide Web in 1989, it has grown enormously and has become an indispensable tool for billions of people around the world. It allows access to billions of web pages, documents, and other resources using an internet web browser. With billions of pages linking to each other, it is of great interest to investigate the graph structure of the web. Mathematically, this vast and expansive graph is fascinating in its own right. But understanding this graph is important for more reasons than a beautiful mental exercise, providing insights into algorithms for crawling, indexing, and ranking the web (as in the PageRank algorithm that we saw earlier in this chapter), searching communities, and discovering the social and other phenomena that characterize its growth or decay.

万维网图有:

The World Wide Web graph has:

节点
Nodes

网页,数十亿规模

Web pages, on the scale of billions

边缘
Edges

这些是从一个页面链接到另一个页面的,规模达数千亿

These are directed from one page linking to another page, on the scale of hundreds of billions

我们感兴趣的是:

We are interested in:

  • 节点的平均度是多少?

  • What is the average degree of the nodes?

  • 节点的度分布(对于入度和出度来说,可能非常不同)。它们是幂律吗?还有其他法律吗?

  • Degree distributions of the nodes (for both the in degree and out degree, which can be very different). Are they power laws? Some other laws?

  • 图表的连通性:连通对的百分比是多少?

  • Connectivity of the graph: what is the percentage of connected pairs?

  • 节点之间的平均距离。

  • Average distances between the nodes.

  • 观察到的网络结构是否依赖于或独立于所使用的特定爬行?

  • Is the observed structure of the web dependent or independent of the particular crawl used?

  • 弱连接组件和强连接组件的特定结构。

  • Particular structures of weakly and strongly connected components.

  • 是否存在巨大的强连通分量?可以到达或从这个巨人到达的节点的比例是多少成分?

  • Is there a giant strongly connected component? What is the proportion of nodes that can reach or be reached from this giant component?

自动分析计算机程序

Automatically Analyzing Computer Programs

我们可以使用图来验证计算机程序、程序推理、可靠性理论、计算机故障诊断以及研究计算机存储器的结构。论文 “Graph Neural Networks on Program Analysis”(Allamanis 2021)就是一个例子:

We can use graphs for verification of computer programs, program reasoning, reliability theory, fault diagnosis in computers, and studying the structure of computer memory. The paper “Graph Neural Networks on Program Analysis” (Allamanis 2021) is one example:

程序分析的目的是确定程序的行为是否符合某些规范。通常,程序分析需要由人类定义和调整。这是一个成本高昂的过程。最近,机器学习方法已显示出以概率方式实现各种程序分析的希望。考虑到程序的结构化性质以及程序分析中图表示的通用性,图神经网络(GNN)提供了一种优雅的方式来表示、学习和推理程序,并且通常用于基于机器学习的程序分析。本文讨论了图神经网络在程序分析中的使用,重点介绍了两个实际用例:变量误用检测和类型推断。

Program analysis aims to determine if a program’s behavior complies with some specification. Commonly, program analyses need to be defined and tuned by humans. This is a costly process. Recently, machine learning methods have shown promise for probabilistically realizing a wide range of program analyses. Given the structured nature of programs, and the commonality of graph representations in program analysis, graph neural networks (GNN) offer an elegant way to represent, learn, and reason about programs and are commonly used in machine learning-based program analyses. This article discusses the use of graph neural networks for program analysis, highlighting two practical use cases: Variable misuse detection and type inference.

计算机科学中的数据结构

Data Structures in Computer Science

数据结构在计算机科学中,“数据结构”是一种存储、管理和组织数据的结构。有不同的数据结构,并且通常以能够有效访问数据(读、写、追加、推断或存储关系等)的方式选择它们。

A data structure in computer science is a structure that stores, manages, and organizes data. There are different data structures, and they are usually chosen in a way that makes it efficient to access the data (read, write, append, infer or store relationships, etc.).

一些数据结构使用图来组织集群中的数据、计算设备,并表示数据和计算或通信网络的流动。还有用于存储和查询图形数据的图形数据库。其他数据库将图形数据转换为更结构化的格式(例如关系格式)。

Some data structures use graphs to organize data, computational devices in a cluster, and represent the flow of data and computation or the communication network. There are also graph databases geared toward storing and querying graph data. Other databases transform graph data to more structured formats (such as relational formats).

以下是图数据结构的一些示例:

Here are some examples for graph data structures:

PageRank算法
PageRank algorithm

我们有已经遇到过 PageRank 算法,以及表示为有向图的网站链接结构,其中节点是网页,边表示从一个页面到另一页面的链接。保存所有网页及其链接结构的数据库可以是图形结构的,其中图形使用链接矩阵或邻接矩阵按原样存储,无需进行转换,也可以进行转换以适应其他非图形数据库的结构。

We have already encountered the PageRank algorithm, along with the link structure of a website represented as a directed graph, where the nodes are the web pages and the edges represent the links from one page to another. A database keeping all the web pages along with their link structures can be either graph structured, where the graph is stored as is using the linking matrix or adjacency matrix with no transformation necessary, or it can transformed to fit the structure of other nongraphical databases.

用于组织数据库中文件的二叉搜索树
Binary search trees for organizing files in a database

二叉搜索树是有序数据结构对于记录的随机和顺序访问以及文件修改都很有效。二叉搜索树的固有顺序加快了搜索时间:我们将树的每一层要排序的数据量减少了一半。它还加快了插入时间:与数组不同,当我们向二叉树数据结构添加节点时,我们会在内存中创建一个新片段并链接到它。这比创建一个新的大数组,然后将数据从较小的数组插入到新的较大数组要快。

Binary search trees are ordered data structures that are efficient for both random and sequential access of records, and for file modification. The inherent order of a binary search tree speeds up search time: we cut the amount of data to sort through by half at each level of the tree. It also speeds up insertion time: unlike an array, when we add a node to the binary tree data structure, we create a new piece in memory and link to it. This is faster than creating a new large array, then inserting the data from the smaller array to the new, larger one.

基于图的信息检索系统
Graph-based information retrieval systems

在一些信息检索系统中我们为每个文档分配一定数量的索引项。我们可以将它们视为文档的指示符、描述符或关键字。这些索引项将表示为图的节点。如果两个索引项恰好密切相关,例如索引网络,我们将用无向边连接这两个索引项。生成的相似图非常大,并且可能是断开的。该图的最大连通子图是其组件,它们自然地对该系统中的文档进行分类。对于信息检索,我们的查询指定一些索引项,即图的某些节点,系统返回包含相应节点的最大完整子图。这给出了索引术语的完整列表,索引术语又指定了我们正在查找的文档寻找。

In some information retrieval systems we assign a certain number of index terms to each document. We can think of these as the document’s indicators, descriptors, or keywords. These index terms will be represented as nodes of a graph. We connect two index terms with an undirected edge if these two happen to be closely related, such as the indices graph and network. The resulting similarity graph is very large, and is possibly disconnected. The maximally connected subgraphs of this graph are its components, and they naturally classify the documents in this system. For information retrieval, our query specifies some index terms, that is, certain nodes of the graph, and the system returns the maximal complete subgraph that includes the corresponding nodes. This gives the complete list of index terms, which in turn specify the documents we are searching for.

分布式网络中的负载均衡

Load Balancing in Distributed Networks

计算世界已经从摩尔定律发展到并行计算再到云计算。在云计算中,我们的数据、文件以及执行文件和数据计算的机器并不在我们附近。他们甚至彼此都不近。随着应用程序变得越来越复杂以及网络流量增加,我们需要网络流量警察的软件或硬件模拟,将网络流量分布到多个服务器上,以便没有单个服务器承受沉重的负载,从而提高应用程序响应时间方面的性能,最终用户体验等等。

The computational world has grown from Moore’s law to parallel computing to cloud computing. In cloud computing, our data, our files, and the machines executing our files and doing computations on our data are not near us. They are not even near each other. As applications become more complex and as network traffic increases, we need the software or hardware analog of a network traffic cop that distributes network traffic across multiple servers so that no single server bears a heavy load, enhancing performance in terms of application response times, end user experience, and so on.

随着流量的增加,需要添加更多设备或节点来处理流量。网络流量分配需要在保护数据安全和隐私的同时进行,并且应该能够在流量瓶颈发生之前进行预测。这正是负载均衡器的作用。不难将分布式网络想象为一个图,其中节点作为连接的服务器和设备。那么负载均衡就是给定图上的流量问题,有多种分配负载的算法。所有算法都在网络图上运行。有些是静态的,分配负载而不更新网络在负载或故障单元方面的当前状态,而另一些是动态的,但需要网络内关于节点状态的持续通信。以下是一些算法:

As traffic increases, more appliances, or nodes, need to be added to handle the volume. Network traffic distributing needs to be done while preserving data security and privacy, and should be able to predict traffic bottlenecks before they happen. This is exactly what load balancers do. It is not hard to imagine the distributed network as a graph with the nodes as the connected servers and appliances. Load balancing is then a traffic flow problem on a given graph, and there are a variety of algorithms for allocating the load. All algorithms operate on the network’s graph. Some are static, allocating loads without updating the network’s current state in terms of loads or malfunctioning units, while others are dynamic, but require constant communication within the network about the nodes’ statuses. The following are some algorithms:

最少连接算法
Least connection algorithm

此方法将流量定向到活动连接最少的服务器。

This method directs traffic to the server with the fewest active connections.

最小响应时间算法
Least response time algorithm

此方法将流量定向到具有最少活动连接和最短平均响应时间的服务器。

This method directs traffic to the server with the fewest active connections and the lowest average response time.

循环算法
Round-robin algorithm

该算法在服务器上轮流分配负载。它将流量定向到第一个可用的服务器,然后将该服务器移动到队列的底部。

This algorithm allocates load in a rotation on the servers. It directs traffic to the first available server, then moves that server to the bottom of the queue.

IP哈希值
IP hash

该方法根据IP地址分配服务器客户。

This method allocates servers based on the IP address of the client.

人工神经网络

Artificial Neural Networks

最后,人工神经网络是一种图,其中节点是计算单元,边是这些单元的输入和输出。图 9-8将流行的人工神经网络模型总结为图表。

Finally, artificial neural networks are graphs where the nodes are the computational units and the edges are inputs and outputs of these units. Figure 9-8 summarizes popular artificial neural network models as graphs.

EMAI 0908
图 9-8。神经网络作为图表(图片来源

图上的随机游走

Random Walks on Graphs

A图上的随机游走(图 9-9)正如其所言:从某个节点开始的一系列步骤,在每个时间步骤,选择一个相邻节点(使用邻接矩阵),其概率与以下节点的权重成正比:边缘,并移动到那里。

A random walk on a graph (Figure 9-9) means exactly what it says: a sequence of steps that starts at some node, and at each time step, chooses a neighboring node (using the adjacency matrix) with probability proportional to the weights of the edges, and moves there.

埃麦0909
图 9-9。无向图上的随机游走

如果边未加权,则相邻节点都同样有可能被选择用于步行移动。在任何时间步,在存在自边缘的情况下,游走可以停留在同一节点,或者当它是具有正概率的惰性随机游走时,游走停留在一个节点而不是移动到其邻居之一。我们对以下内容感兴趣:

If the edges are unweighted, then the neighboring nodes are all equally likely to be chosen for the walk’s move. At any time step, the walk can stay at the same node in the case that there is a self edge, or when it is a lazy random walk with a positive probability the walk stays at a node instead of moving to one of its neighbors. We are interested in the following:

  • 随机游走访问的节点列表是什么(按访问顺序排列)?在这里,图表的起点和结构关系到步行覆盖的范围或步行是否可以到达图表的某些区域。在图神经网络中,我们感兴趣的是根据给定节点的邻居特征学习其表示。在大型图中,节点的邻居数量多于计算上可行的数量,我们采用随机游走。然而,我们必须小心,因为图的不同部分具有不同的随机游走扩展速度,如果我们不考虑这一点,根据子图结构调整随机游走的步数,我们最终可能会得到节点的低质量表示,以及在工作流程中出现不良结果。

  • What is the list of nodes visited by a random walk, in the order they are visited? Here, the starting point and the structure of the graph matter in how much a walk covers or whether the walk can ever reach certain regions of the graph. In graph neural networks, we are interested in learning a representation for a given node based on its neighbors’ features. In large graphs where nodes have more neighbors than is computationally feasible, we employ random walks. However, we have to be careful since different parts of the graph have different random walk expansion speeds, and if we do not take that into account by adjusting the number of steps of a random walk according to the subgraph structure, we might end up with low-quality representations for the nodes, and undesirable outcomes as these go down the work pipeline.

  • 随机游走的预期行为是什么,即经过一定步数后访问节点上的概率分布?我们可以通过使用图的谱(其邻接矩阵的特征值集合)来研究随机游走的基本属性,例如其长期行为。一般来说,运算符的频谱可以帮助我们理解当我们重复应用该运算符时会发生什么。在图上随机行走相当于将邻接矩阵的归一化版本重复应用于我们开始的图的节点。每次我们应用这个随机游走矩阵时,我们都会在图上进一步走一步。

  • What is the expected behavior of a random walk, that is, the probability distribution over the visited nodes after a certain number of steps? We can study basic properties of a random walk, such as its long-term behavior, by using the spectrum of the graph, which is the set of eigenvalues of its adjacency matrix. In general, the spectrum of an operator helps us understand what happens when we repeatedly apply the operator. Randomly walking on a graph is equivalent to repeatedly applying a normalized version of the adjacency matrix to the node of the graph where we started. Every time we apply this random walk matrix, we walk one step further on the graph.

  • 随机游走在不同类型的图上表现如何,例如路径、树、由一条边连接的两个完全连接的图、无限图等?

  • How does a random walk behave on different types of graphs, such as paths, trees, two fully connected graphs joined by one edge, infinite graphs, and others?

  • 对于给定的图,步行是否会返回到其起点?如果可以的话,我们需要步行多长时间才能回来?

  • For a given graph, does the walk ever return to its starting point? If so, how long do we have to walk until we return?

  • 我们需要步行多长时间才能到达特定节点?

  • How long do we have to walk until we reach a specific node?

  • 我们需要走多长时间才能访问完所有节点?

  • How long do we have to walk until we visit all the nodes?

  • 随机游走如何展开?也就是说,属于图中某些区域的某些节点的影响力分布是什么?他们的影响力有多大?

  • How does a random walk expand? That is, what is the influence distribution of certain nodes that belong in certain regions of the graph? What is the size of their influence?

  • 我们可以设计基于随机游走的算法来到达大图的模糊部分吗?

  • Can we design algorithms based on random walks that are able to reach obscure parts of large graphs?

随机游走和布朗运动

Random Walks and Brownian Motion

在极限内当随机游走的步长变为零时,我们获得布朗运动。布朗运动通常模拟悬浮在流体等介质中的粒子的随机波动,或金融市场中衍生品的价格波动。我们经常遇到术语“布朗运动”和“韦纳过程”,这是一种连续随机过程,具有明确的数学定义,说明运动(粒子或金融中的价格波动)如何开始(零)、下一步如何采样(从正态分布和独立增量),并假设其连续性作为时间的函数(它几乎肯定是连续的)。与之相关的另一个术语是我们将在第 11 章概率中看到这些内容。

In the limit where the size of steps of a random walk goes to zero, we obtain a Brownian motion. Brownian motion usually models the random fluctuations of particles suspended in a medium such as a fluid, or price fluctuations of derivatives in financial markets. We frequently encounter the term Brownian motion with the term Weiner process, which is a continuous stochastic process with a clear mathematical definition of how the motion (of a particle or of a price fluctuation in finance) starts (at zero), how the next step is sampled (from the normal distribution and with independent increments), and assumptions about its continuity as a function of time (it is almost surely continuous). Another term it is associated with is martingale. We will see these in Chapter 11 on probability.

在本章讨论 PageRank 算法时,我们确实遇到过一次随机游走,即随机网页浏览者随机选择从他们所在的页面移动到网络上的相邻页面。我们注意到,当我们重复应用图的链接矩阵时,就会发现行走的长期行为,该链接矩阵与按每个节点的度数归一化的邻接矩阵相同。在下一节中,我们将看到随机游走在图神经网络中的更多用途。

We did encounter a random walk once in this chapter when discussing the PageRank algorithm, where a random web page surfer randomly chooses to move from the page they are at to a neighboring page on the web. We noticed that the long-term behavior of the walk is discovered when we repeatedly apply the linking matrix of the graph, which is the same as the adjacency matrix normalized by the degrees of each node. In the next section we see more uses of random walks for graph neural networks.

我们可以使用随机游走(在有向图或无向图、加权图或无权图上)进行社区检测并在小型网络中影响最大化,其中我们只需要图的邻接矩阵(而不是将节点嵌入到特征向量中,然后聚类)。

We can use random walks (on directed or undirected graphs, weighted or unweighted graphs) for community detection and influence maximization in small networks, where we would only need the graph’s adjacency matrix (as opposed to node embedding into feature vectors, then clustering).

节点表示学习

Node Representation Learning

在机器上实现任何图形任务,我们必须能够将图形的节点表示为向量,其中包含有关它们在图形中的位置以及相对于它们在图形中的位置的特征的信息。节点的表示向量通常由节点自身的特征和其周围节点的特征聚合而成。

Before implementing any graph tasks on a machine, we must be able to represent the nodes of the graph as vectors that contain information about their position in the graph and their features relative to their locality within the graph. A node’s representation vector is usually aggregated from the node’s own features and the features of the nodes surrounding it.

有不同的方法来聚合特征、转换特征,甚至选择哪些相邻节点有助于给定节点的特征表示。让我们回顾一下几种方法:

There are different ways to aggregate features, transform them, or even choose which of the neighboring nodes contribute to the feature representation of a given node. Let’s go over a few methods:

  • 传统的节点表示方法依赖于子图汇总统计。

  • Traditional node representation methods rely on subgraph summary statistics.

  • 在其他方法中,在短随机游走中一起出现的节点将具有相似的向量表示。

  • In other methods, nodes that occur together on short random walks will have similar vector representations.

  • 其他方法考虑到随机游走在不同图子结构上的传播方式不同,因此节点的表示方法会适应该节点所属的局部子结构,根据子图的拓扑结构决定每个节点的适当影响半径它属于。

  • Other methods take into account that random walks tend to spread differently on different graph substructures, so a node’s representation method adapts to the local substructure that the node belongs in, deciding on an appropriate radius on influence for each node depending on the topology of the subgraph it belongs to.

  • 还有其他方法通过将节点的特征向量与随机游走矩阵的幂相乘来产生多尺度表示。

  • Yet other methods produce a multiscale representation by multiplying the feature vector of a node with powers of a random walk matrix.

  • 其他方法允许节点及其邻居的特征向量的非线性聚合。

  • Other methods allow for nonlinear aggregation of the feature vectors of a node and its neighbors.

确定节点从多大的邻域中获取信息(影响分布)也很重要,即找到其特征影响给定节点表示的节点范围。这类似于统计学中的敏感性分析,但这里我们需要确定节点表示对其周围节点特征变化的敏感性。

It is also important to determine how large of a neighborhood a node draws information from (influence distribution), that is, to find the range of nodes whose features affect a given node’s representation. This is analogous to sensitivity analysis in statistics, but here we need to determine the sensitivity of a node’s representation to changes in the features of the nodes surrounding it.

创建节点表示向量后,我们在训练期间将其输入到另一个机器学习模型中,例如用于分类的支持向量机模型,就像将数据的其他特征输入到模型中一样。例如,我们可以学习社交网络中每个用户的特征向量,然后将这些向量与其他特征一起传递到分类模型中,以预测该用户是否是假新闻传播者。然而,我们不必依赖下游的机器学习模型来对节点进行分类。我们可以直接从图结构数据中做到这一点,我们可以根据节点与其他本地节点的关联来预测节点的类别。该图只能被部分标记,任务是预测其余的标签。此外,节点表示步骤可以是预处理步骤,也可以是端到端模型的一部分,例如图神经网络。

After creating a node representation vector, we feed it into another machine learning model during training, such as a support vector machine model for classification, just like feeding other features of the data into the model. For example, we can learn the feature vector of every user in a social network, then pass these vectors into a classification model along with other features to predict if the user is a fake news spreader or not. However, we do not have to rely on a machine learning model downstream to classify nodes. We can do that directly from the graph structural data where we can predict a node’s class depending on its association with other local nodes. The graph can be only partially labeled, and the task is to predict the rest of the labels. Moreover, the node representation step can either be a preprocessing step, or one part of an end-to-end model, such as a graph neural network.

图神经网络的任务

Tasks for Graph Neural Networks

在了解了图的线性代数公式、图模型的应用、图上的随机游走以及对节点特征及其在图中的影响区域进行编码的向量节点表示之后,我们应该对需要执行的任务类型有一个很好的了解。图神经网络可以执行。让我们来看看其中的一些。

After going through the linear algebra formulation of graphs, applications of graph models, random walks on graphs, and vector node representations that encode node features along with their zones of influence within the graph, we should have a good idea about the kind of tasks that graph neural networks can perform. Let’s go through some of these.

节点分类

Node Classification

以下是节点分类任务的示例:

The following are examples of node classification tasks:

  • 在文章引用网络中,例如 CiteSeerX 或 CoRA,其中节点是学术论文(以词袋向量形式给出),有向边是论文之间的引用,将每篇论文分类到特定学科。

  • In an articles citation network, such as in CiteSeerX or CoRA, where the nodes are academic papers (given as bag-of-words vectors) and the directed edges are the citations between the papers, classify each paper into a specific discipline.

  • 在 Reddit 数据集中,节点是评论帖子(以词向量形式给出),无向边位于同一用户发布的评论之间,根据每个帖子所属的社区对每个帖子进行分类。

  • In the Reddit data set, where the nodes are comment posts (given as word vectors) and undirected edges are between comments posted by the same user, classify each post according to the community it belongs to.

  • 蛋白质-蛋白质相互作用网络数据集包含 24 个图,其中节点用基因本体集标记(不要担心医学技术名称,而是关注数学。这是数学建模的好处,它的工作原理是一样的)适用于各个领域的各种应用程序的方式,这验证了它作为宇宙潜在的底层语言)。通常,蛋白质-蛋白质相互作用网络数据集中的 20 个图用于训练,2 个图用于验证,其余用于测试,每个图对应一个人体组织。与节点相关的特征是位置基因集、基序基因集和免疫学特征。根据每个节点的基因本体集对其进行分类。

  • The protein-protein interaction network data set contains 24 graphs where the nodes are labeled with the gene ontology sets (do not worry about the medical technical names, focus on the math instead. This is the nice thing about mathematical modeling, it works the same way for all kinds of applications from all kinds of fields, which validates it as a potential underlying language of the universe). Usually 20 graphs from the protein-protein interaction network data set are used for training, 2 graphs are used for validation, and the rest for testing, each corresponding to a human tissue. The features associated with the nodes are positional gene sets, motif gene sets, and immunological signatures. Classify each node according to its gene ontology set.

  • 有野生动物贸易监测网络,例如Traffic.org,分析动态野生动物贸易趋势,并使用和更新数据集,例如CITES 野生动物贸易数据库USDA Ag Data Commons(该数据集包括超过 100 万种野生动物或野生动物产品运输) ,代表 60 多个生物类别和超过 32 亿个活生物体)。贸易网络上的一项分类任务是对每个节点进行分类,代表交易者(买方或卖方)是否从事非法贸易活动。网络中的边代表买方和卖方之间的贸易交易。节点的特征包括交易者的个人信息、银行账号、位置等;边缘的特征包括交易识别号、日期、价格标签、购买或出售的品种等。如果我们已经有一部分交易者被标记为非法交易者,那么我们模型的任务就是根据网络中其他节点与网络中其他节点(及其特征)的连接来预测网络中其他节点的标签。

  • There are wildlife trade monitoring networks such as Traffic.org, analyzing dynamic wildlife trade trends and using and updating data sets such as the CITES Wildlife Trade Database or the USDA Ag Data Commons (this data set includes more than a million wildlife or wildlife product shipments, representing more than 60 biological classes and more than 3.2 billion live organisms). One classification task on the trade network is to classify each node, representing a trader (a buyer or a seller) as being engaged in illegal trade activity or not. The edges in the network represent a trade transaction between the buyer and seller. Features for the nodes include personal information for the traders, bank account numbers, locations, etc.; and features for the edges would include transaction identification numbers, dates, price tags, the species bought or sold, and so on. If we already have a subset of the traders labeled as illegal traders, our model’s task would then be to predict the labels of other nodes in the network based on their connections with other nodes (and their features) in the network.

节点分类示例自然适合半监督学习,其中数据集中只有少数节点带有标签,任务是标记其余节点。我们所有人都应该提倡清洁标记数据,以使我们的系统更加准确、可靠、并且透明。

Node classification examples lend themselves naturally to semi-supervised learning, where only a few nodes in the data set come with their labels, and the task is to label the rest of the nodes. Clean labeled data is what we all should be advocating for our systems to be more accurate, reliable, and transparent.

图分类

Graph Classification

有时我们想要标记整个图,而不是标记图的各个节点。例如,在蛋白质数据集中,我们有一组化合物,每个化合物都以图表的形式表示,并标记为酶或非酶。对于图学习模型,我们将输入数据集中每个图的节点、边、它们的特征、图结构和标签,从而创建整个图表示或嵌入,而不是单个节点表示。

Sometimes we want to label a whole graph as opposed to labeling the individual nodes of a graph. For example, in the PROTEINS data set we have a collection of chemical compounds each represented as a graph and labeled as either an enzyme or not an enzyme. For a graph learning model we would input the nodes, edges, their features, the graph structure, and the label for each graph in the data set, thus creating a whole graph representation or embedding, as opposed to a single node representation.

聚类和社区检测

Clustering and Community Detection

聚类于图表是发现网络中的社区或团体(例如恐怖组织)的一项重要任务。一种方法是创建节点和图形表示,然后将它们输入到传统的聚类方法(例如 k 均值聚类)中。其他方法生成节点和图形表示,在其设计中考虑了聚类的目标。这些可以包括编码器-解码器设计和注意力机制,类似于我们在前几章中遇到的方法。其他方法是谱方法,这意味着它们依赖于图的拉普拉斯矩阵的特征值。请注意,对于非图数据,主成分分析是我们用于聚类的一种方法,它也是谱的,依赖于数据表的奇异值。计算任何事物的特征值始终是一项昂贵的操作,因此我们的目标是找到避免这样做的方法。对于图,我们可以采用图论方法,例如最大流最小割(我们将在本章后面看到)。不同的方法有各自的优点和缺点;例如,有些采用经过时间验证的图论结果,但未能包含节点或边缘特征,因为该理论在开发时并未考虑到任何特征,更不用说一大堆特征了。传达的信息是要始终诚实地了解我们的模型考虑和不考虑的内容。

Clustering in graphs is an important task that discovers communities or groups in networks, such as terrorist organizations. One way is to create node and graph representations, then feed them into traditional clustering methods such as k-means clustering. Other ways produce node and graph representations that take into account the goal of clustering within their design. These can include encoder-decoder designs and attention mechanisms similar to methods we encountered in previous chapters. Other methods are spectral, which means that they rely on the eigenvalues of the Laplacian matrix of the graph. Note that for nongraph data, principal component analysis is one method that we used for clustering that is also spectral, relying on the singular values of the data table. Computing eigenvalues of anything is always an expensive operation, so the goal becomes finding ways around having to do it. For graphs, we can employ graph theoretic methods such as max-flow min-cut (we will see this later in this chapter). Different methods have their own sets of strengths and shortcomings; for example, some employ time-proven graph theoretic results but fail to include the node or edge features because the theory was not developed with any features in mind, let alone a whole bunch of them. The message is to always be honest about what our models account and do not account for.

图生成

Graph Generation

图的生成是对于药物发现、材料设计和其他应用非常重要。传统上,图生成方法使用手工制作的随机图模型族,并使用简单的随机生成过程。这些模型由于其简单的属性而在数学上很容易理解。然而,出于同样的原因,它们捕获具有更复杂依赖关系的现实世界图的能力受到限制,甚至捕获正确的统计属性(例如许多现实世界网络表现出的节点度的重尾分布)的能力也受到限制。最近的方法,例如生成图神经网络,将图和节点表示与我们在上一章中讨论过的生成模型集成在一起。它们具有更大的能力从数据中学习结构信息并生成复杂的图形,例如分子和化合物。

Graph generation is very important for drug discovery, materials design, and other applications. Traditionally, graph generation approaches used handcrafted families of random graph models using a simple stochastic generation process. These models are well understood mathematically due to their simple properties. However, for the same reason, they are limited in their ability to capture real-world graphs with more complex dependencies, or even the correct statistical properties, such as heavy tailed distribution for the node degrees that many real-world networks exhibit. More recent approaches, such as generative graph neural networks, integrate graph and node representations with generative models that we went over in the previous chapter. These have a greater capacity to learn structural information from the data and generate complex graphs, such as molecules and compounds.

影响力最大化

Influence Maximization

影响力最大化是网络扩散的一个子领域,其目标是通过网络最大限度地扩散某些东西,例如信息或疫苗,同时只将这些东西提供给少数初始节点或种子。目标是找到总体影响最大的少数节点。应用包括职位空缺、新闻、广告和疫苗接种等信息传播。传统的种子定位方法是根据最高度、接近度、介数和其他图结构属性来选择节点。其他人则采用离散优化领域,获得了良好的结果并证明了近似优化器的存在。当存在其他目标与最大化节点影响力的目标相竞争时,最新的方法采用图神经网络和对抗网络,例如,达到人口的特定部分,例如某个少数群体,该群体不一定与节点密切相关。图的自然中心。

Influence maximization is a subfield of network diffusion, where the goal is to maximize the diffusion of something, such as information or a vaccine, through a network, while only giving the thing to few initial nodes, or seeds. The objective is to locate the few nodes that have an overall maximal influence. Applications include information propagation such as job openings, news, advertisements, and vaccinations. Traditional methods for locating the seeds choose the nodes based on highest degree, closeness, betweenness, and other graph structure properties. Others employ the field of discrete optimization, obtaining good results and proving the existence of approximate optimizers. More recent approaches employ graph neural networks and adversarial networks when there are other objectives competing with the objective of maximizing a node’s influence, for example, reaching a specific portion of the population, such as a certain minority group, that is not necessarily strongly connected with the natural hubs of the graph.

链接预测

Link Prediction

给定图的两个节点,有一条边连接它们的概率是多少?请注意,共享共同邻居意义上的接近度不一定是链接(或交互)的指标。在社交网络中,人们倾向于在同一个圈子中活动,因此拥有许多共同朋友的两个人可能会互动并且也可能有联系。但在某些生物系统中,例如在研究蛋白质-蛋白质相互作用时,情况恰恰相反:共享更多共同邻居的蛋白质相互作用的可能性较小。因此,基于图距离、度数、公共邻居等基本属性计算相似度分数并不总是产生正确的结果。我们需要神经网络来学习节点和图嵌入,以及对两个节点是否链接进行分类。这种网络的一个例子是论文“Link Prediction with Graph Neural Networks and Knowledge Extraction”(Zhang et al. 2020)。

Given two nodes of a graph, what is the probability that there is an edge linking them? Note that proximity in the sense of sharing common neighbors is not necessarily an indicator for a link (or an interaction). In a social network, people tend to run in the same circles, so two people sharing many common friends are likely to interact and are likely connected as well. But in some biological systems, such as in studying protein-protein interactions, the opposite is true: proteins sharing more common neighbors are less likely to interact. Therefore, computing similarity scores based on basic properties such as graph distance, degrees, common neighbors, and so on does not always produce correct results. We need neural networks to learn node and graph embeddings along with classifying whether two nodes are linked or not. One example of such networks is in the paper “Link Prediction with Graph Neural Networks and Knowledge Extraction” (Zhang et al. 2020).

动态图模型

Dynamic Graph Models

许多我们在本章中讨论的应用程序将受益于在我们的图模型中包含时间依赖性,因为它们本质上是动态的。示例包括流量预测、分布式网络的负载平衡、模拟各种相互作用的粒子系统以及非法野生动物贸易监控。在动态图模型中,允许节点和边特征随时间演化,并且可以添加或删除一些节点或边。该模型捕获诸如市场最新贸易趋势、波动、某些网络中的新犯罪活动以及运输系统中的新路线或连接等信息。

Many of the applications we have discussed in this chapter would benefit from including time dependency in our graph models, since they are dynamic in nature. Examples include traffic forecasting, load balancing for distributed networks, simulating all kinds of interacting particle systems, and illegal wildlife trade monitoring. In dynamic graph models, node and edge features are allowed to evolve with time, and some nodes or edges can be added or removed. This modeling captures information such as the latest trade trends in a market, fluctuations, new criminal activity in certain networks, and new routes or connections in transportation systems.

思考如何对动态图进行建模并从中提取信息并不新鲜。例如,请参阅文章“动态图模型”(Harray 等人,1997 年)。然而,深度学习的引入使得从此类系统中提取知识变得更加简单。当前的动态图方法集成了图卷积来捕获空间依赖性,并使用循环神经网络或卷积神经网络来建模时间依赖性。

Thinking about how to model dynamic graphs and extract information from them is not new; see, for example, the article “Dynamic Graph Models” (Harray et al. 1997). However, the introduction of deep learning makes knowledge extraction from such systems more straightforward. Current approaches for dynamic graphs integrate graph convolutions to capture spatial dependencies with recurrent neural networks or convolutional neural networks to model temporal dependencies.

论文“Learning to Simulate Complex Chemistry with Graph Networks”(Sanchez-Gonzalez et al. 2020)就是一个很好的例子,它具有出色的高分辨率结果,它采用动态图神经网络来模拟更大范围内的任何相互作用粒子的系统,就涉及的粒子数量和允许系统(数字)演化的时间而言,比以前所做的要好。粒子,例如沙子或水粒子,是图的节点,具有位置、速度、压力、外力等属性,边连接允许彼此相互作用的粒子。神经网络的输入是一个图,输出是一个具有相同节点和边但具有更新的粒子位置和属性的图。网络通过消息传递来学习每个时间步的动态或更新规则。更新规则取决于系统在当前时间步的状态,以及参数化函数,该函数的参数针对取决于特定应用的某些训练目标进行了优化,这是任何神经网络中的主要步骤。监督学习的预测目标是平均加速度每个粒子。

The paper “Learning to Simulate Complex Physics with Graph Networks” (Sanchez-Gonzalez et al. 2020) is a great example with wonderful high-resolution results that employs a dynamic graph neural network to simulate any system of interacting particles on a much larger scale, in terms of both the number of involved particles and the time the system is allowed to (numerically) evolve, than what was done before. The particles, such as sand or water particles, are the nodes of the graph, with attributes such as position, velocity, pressure, external forces, etc., and the edges connect the particles that are allowed to interact with each other. The input to the neural network is a graph, and the output is a graph with the same nodes and edges but with updated attributes of particle positions and properties. The network learns the dynamics, or the update rule at each time step, via message passing. The update rule depends on the system’s state at the current time step, and on a parameterized function whose parameters are optimized for some training objective that depends on the specific application, which is the main step in any neural network. The prediction targets for supervised learning are the average acceleration for each particle.

贝叶斯网络

Bayesian Networks

贝叶斯网络是完全能够处理不确定性的图表,以数学上合理的方式编码概率。图9-109-11是两个贝叶斯网络的示例。

Bayesian networks are graphs that are perfectly equipped to deal with uncertainty, encoding probabilities in a mathematically sound way. Figures 9-10 and 9-11 are examples of two Bayesian networks.

在贝叶斯网络中:

In a Bayesian network:

  • 节点是我们认为模型应该包含的变量。

  • The nodes are variables that we believe our model should include.

  • 边是有向的,从节点指向节点,或者从较高的神经元指向较低的神经元,从某种意义上说,我们知道子变量的概率取决于观察父变量。

  • The edges are directed, pointing from the parent node to the child node, or from a higher neuron to a lower neuron, in the sense that we know the probability of the child variable is conditional on observing the parent variable.

  • 网络图中不允许有循环。

  • No cycles are allowed in the graph of the network.

  • 严重依赖贝叶斯规则:如果有一个从A到B的箭头,那么P ( B | A )是前向概率,P ( A | B )是逆向概率。将此视为P(证据|假设)或P(症状|疾病)。我们可以根据贝叶斯法则计算逆概率:

    A | = |AA
  • Heavy reliance on Bayes’ Rule: if there is an arrow from A to B, then P(B|A) is the forward probability, and P(A|B) is the inverse probability. Think of this as P(evidence|hypothesis) or P(symptoms|disease). We can calculate the inverse probability from Bayes’ Rule:

    P ( A | B ) = P(B|A)P(A) P(B)
  • 如果没有箭头指向某个变量(如果它没有父母),那么我们需要的只是该变量的先验概率,我们根据数据或专家知识计算该先验概率,例如美国 13% 的女性发展为乳腺癌。

  • If there is no arrow pointing to a variable (if it has no parents), then all we need is the prior probability of that variable, which we compute from the data or from expert knowledge, such as 13% of women in the US develop breast cancer.

  • 如果我们碰巧获得了模型中某个变量的更多数据,或者更多的证据,我们会更新与该变量对应的节点(条件概率),然后沿着网络中的连接传播该信息,更新条件概率每个节点,以两种不同的方式,具体取决于信息是从父节点传播到子节点,还是从子节点传播到父节点。各个方向的更新很简单:遵守贝叶斯法则。

  • If we happen to obtain more data on one of the variables in the model, or more evidence, we update the node corresponding to that variable (the conditional probability), then propagate that information following the connections in the network, updating the conditional probabilities at each node, in two different ways, depending on whether the information is propagating from parent to child, or from child to parent. The update in each direction is very simple: comply with Bayes’ Rule.

埃麦0910
图 9-10。贝叶斯网络
埃麦0911
图 9-11。另一个贝叶斯网络

贝叶斯网络表示紧凑的条件概率表

A Bayesian Network Represents a Compactified Conditional Probability Table

真是个贝叶斯网络表示的是一个紧凑的条件概率表。通常,当我们对现实场景进行建模时,每个离散变量可以假设某些离散值或类别,并且每个连续变量可以假设给定连续范围内的任何值。理论上,我们可以构建一个巨大的条件概率表,在假设其他变量给定固定值的情况下,给出每个变量的概率。实际上,即使对于相当少量的变量,这对于存储和计算来说也是不可行且昂贵的。此外,我们无法访问构建该表所需的所有信息。贝叶斯网络通过允许变量仅与几个相邻变量交互来绕过这个障碍,因此我们只需要在给定网络图中直接与其连接的那些变量的状态的情况下计算变量的概率,尽管无论是前向还是前向落后。如果关于网络中任何变量的新证据出现,那么图结构与贝叶斯规则一起指导我们以系统的、可解释的和透明的方式更新网络中所有变量的概率。以这种方式稀疏化网络是贝叶斯网络的一个特征,这使得它们取得了成功。

What a Bayesian network represents is a compactified conditional probability table. Usually when we model a real-world scenario, each discrete variable can assume certain discrete values or categories, and each continuous variable can assume any value in a given continuous range. In theory, we can construct a giant conditional probability table that gives the probability of each variable, assuming a certain state of given fixed values of the other variables. In reality, even for a reasonably small number of variables, this is infeasible and expensive both for storage and computations. Moreover, we do not have access to all the information required to construct the table. Bayesian networks get around this hurdle by allowing variables to interact with only a few neighboring variables, so we only have to compute the probability of a variable given the states of those variables directly connected to it in the graph of the network, albeit both forward and backward. If new evidence about any variable in the network arrives, then the graph structure, together with Bayes’ Rule, guides us to update the probabilities of all the variables in the network in a systematic, explainable, and transparent way. Sparsifying the network this way is a feature of Bayesian networks that has allowed for their success.

总之,贝叶斯网络的图将模型变量(或数据特征)的联合概率分布指定为局部条件概率分布的乘积,每个节点一个:

In summary, a Bayesian network’s graph specifies the joint probability distribution of the model’s variables (or the data’s features) as a product of local conditional probability distributions, one for each node:

X 1 , X 2 , , X n = =1 n X | 父母 X

使用贝叶斯网络进行预测

Making Predictions Using a Bayesian Network

一旦设置贝叶斯网络并启动条件概率(并使用更多数据不断更新),我们只需根据查询沿着贝叶斯规则或乘积规则搜索这些条件概率分布表即可快速获得结果。例如,考虑到该电子邮件包含的单词、发件人位置、一天中的时间、其中包含的链接、发件人和收件人之间的交互历史记录以及垃圾邮件的其他值,该电子邮件是垃圾邮件的概率是多少检测变量?根据乳房X光检查结果、家族史、症状、血液检查等,患者患有乳腺癌的概率是多少?这里最好的部分是,我们不需要消耗能量来执行大型程序或使用大型计算机集群来获得结果。也就是说,我们的手机和平板电脑电池将持续更长时间,因为它们不必花费太多计算能力来编码和解码消息;相反,他们应用贝叶斯网络来使用turbo解码前向纠错算法

Once a Bayesian network is set and the conditional probabilities initiated (and continuously updated using more data), we can get quick results simply by searching these conditional probability distribution tables along Bayes’ Rule or the product rule, depending on the query. For example, what is the probability that this email is spam given the words it contains, the sender location, the time of the day, the links it includes, the history of interaction between the sender and the recipient, and other values of the spam detection variables? What is the probability that a patient has breast cancer given the mammogram test result, family history, symptoms, blood tests, etc.? The best part here is that we do not need to consume energy executing large programs or employ large clusters of computers to get our results. That is, our phone and tablet batteries will last longer because they don’t have to spend much computational power coding and decoding messages; instead, they apply Bayesian networks for using the turbo decoding forward error correction algorithm.

贝叶斯网络是信念网络,而不是因果网络

Bayesian Networks Are Belief Networks, Not Causal Networks

贝叶斯网络,虽然从父变量到子变量的尖箭头最好是因果关系,但一般来说它不是因果关系。这意味着我们可以在给定其父变量的状态的情况下对子变量的概率分布进行建模,并且我们可以使用贝叶斯规则来找到逆概率:给定子变量的父变量的概率分布。这通常是更困难的方向,因为它不太直观并且更难观察。思考这个问题的一种方法是,即使在生孩子之前我们就知道父母的特征P(child|mother,father),计算孩子特征的概率分布比推断父母的特征更容易假设我们知道孩子的特征P(father|child)P(mother|child)。本例中的贝叶斯网络(图 9-12)具有三个节点:母亲、父亲和孩子,其中一条边从母亲指向孩子,另一条边从父亲指向孩子。

In Bayesian networks, although a pointed arrow from parent variable to child variable is preferably causal, in general it is not causal. All it means is that we can model the probability distribution of a child variable given the states of its parent(s), and we can use Bayes’ Rule to find the inverse probability: the probability distribution of a parent given the child. This is usually the more difficult direction because it is less intuitive and harder to observe. One way to think about this is that it is easier to calculate the probability distribution of a child’s traits given that we know the parents’ traits, P(child|mother,father), even before having the child, than inferring the parents’ traits given that we know the child’s traits P(father|child) and P(mother|child). The Bayesian network in this example (Figure 9-12) has three nodes, mother, father, and child, with an edge pointing from the mother to the child, and another edge pointing from the father to the child.

埃麦0912
图 9-12。子变量是贝叶斯网络中的碰撞器

母亲和父亲之间没有边缘,因为他们的特征没有理由相关。了解母亲的特征并不能提供有关父亲特征的信息;然而,了解母亲的特征和孩子的特征可以让我们稍微更多地了解父亲的特征或分布P(father|mother,child)。这意味着,母亲和父亲的特质原本是独立的,但在了解孩子的特质后,就会有条件地依赖。因此,贝叶斯网络对图结构中的变量之间的依赖性进行建模,提供了变量如何相互关联的映射:它们的条件依赖性和独立性这就是为什么贝叶斯网络也被称为信念网络。

There is no edge between the mother and the father because there is no reason for their traits to be related. Knowing the mother’s traits gives us no information about the father’s traits; however, knowing the mother’s traits and the child’s traits allows us to know slightly more about the father’s traits, or the distribution P(father|mother,child). This means that the mother’s and father’s traits, which were originally independent, are conditionally dependent given knowing the child’s traits. Thus, a Bayesian network models the dependencies between variables in a graph structure, providing a map for how the variables are believed to relate to each other: their conditional dependencies and independencies. This is why Bayesian networks are also called belief networks.

关于贝叶斯网络请记住这一点

Keep This in Mind About Bayesian Networks

关于贝叶斯网络,让我们记住以下几点:

Let’s keep the following in mind about Bayesian networks:

  • 贝叶斯网络没有因果方向,在回答因果问题或“为什么”问题方面受到限制,例如:是什么导致了某种疾病的发作?也就是说,我们很快就会知道我们可以使用贝叶斯网络进行因果推理,并预测干预的后果。无论是否用于因果推理,我们如何更新贝叶斯网络,或者我们如何传播信念总是以相同的方式工作。

  • Bayesian networks have no causal direction and are limited in answering causal questions, or “why” questions, such as: what caused the onset of a certain disease? That said, we will soon learn that we can use a Bayesian network for causal reasoning, and to predict the consequences of intervention. Whether used for causal reasoning or not, how we update a Bayesian network, or how we propagate the belief, always works in the same way.

  • 如果某些变量缺少数据,贝叶斯网络可以处理它,因为它们被设计为有效地将信息从具有丰富信息的变量传播到具有较少信息的变量。

  • If some variables have missing data, Bayesian networks can handle it because they are designed to propagate information efficiently from variables with abundant information about them to variables with less information.

链条、叉子和碰撞器

Chains, Forks, and Colliders

贝叶斯网络(具有三个或更多节点)的构建块是三种类型的连接点:链、叉和碰撞器,如图9-13所示。

The building blocks of Bayesian networks (with three or more nodes) are three types of junctions: chain, fork, and collider, illustrated in Figure 9-13.

埃麦0913
图 9-13。贝叶斯网络中的三种类型的连接点
链条:A C
Chain: A B C

在这个链条中,B是一个中介者。如果我们知道 B 的值,那么了解 A 不会增加或减少我们对 C 的信念。因此,鉴于我们知道中介 B 的值,A 和 C 是有条件独立的。条件独立性允许我们和机器使用贝叶斯网络,仅关注相关信息。

In this chain, B is a mediator. If we know the value of B, then learning about A does not increase or decrease our belief in C. Thus, A and C are conditionally independent, given that we know the value of the mediator B. Conditional independence allows us, and a machine using a Bayesian network, to focus only on the relevant information.

前叉:B 甲和乙 C
Fork: B A and B C

也就是说,B 是 A 和 C 的共同父项或混杂因素。数据将显示 A 和 C 在统计上相关,即使它们之间没有因果关系。我们可以通过调节混杂因素 B 来揭露这种虚假的相关性。

That is, B is a common parent or a confounder of A and C. The data will show that A and C are statistically correlated even though there is no causal relationship between them. We can expose this fake correlation by conditioning on the confounder B.

对撞机:A 乙和丙
Collider: A B and C B

当我们以中间的变量为条件时,碰撞器与链或叉子不同。我们在父母指着孩子的例子中看到了这一点。如果 A 和 C 本来是独立的,那么对 B 的调节就会使它们变得依赖!这种意想不到的、非因果的信息传递是贝叶斯网络调节的一个特征:对碰撞器的调节恰好打开了其父母之间的依赖路径。

Colliders are different than chains or forks when we condition on the variable in the middle. We saw this in the example of parents pointing to a child. If A and C are originally independent, conditioning on B makes them dependent! This unexpected and noncausal transfer of information is one characteristic of Bayesian networks conditioning: conditioning on a collider happens to open a dependence path between its parents.

我们需要注意的另一件事是:当结果和调解者混淆时。在这种情况下,调节中介者与持有中介者是不同的持续的。

Another thing we need to be careful about: when the outcome and mediator are confounded. Conditioning on the mediator in this case is different than holding it constant.

给定一个数据集,我们如何为涉及的变量建立贝叶斯网络

Given a Data Set, How Do We Set Up a Bayesian Network for the Involved Variables?

贝叶斯网络的图结构可以由我们手动决定,也可以通过算法从数据中学习。贝叶斯网络的算法非常成熟,并且有商业化的算法。一旦网络的结构设置到位,如果关于网络中某个变量的新信息到达,可以很容易地按照图解并通过网络传播信息来更新每个节点的条件概率,从而更新信念关于网络中的每个变量。贝叶斯网络的发明者Judea Pearl ,将这种更新过程比作活的有机组织和神经元的生物网络,如果你激发一个神经元,整个网络就会做出反应,将信息从一个神经元传播到它的邻居。

The graph structure of a Bayesian network can be decided on manually by us, or learned from the data by algorithms. Algorithms for Bayesian networks are very mature and there are commercial ones. Once the network’s structure is set in place, if a new piece of information about a certain variable in the network arrives, it is easy to update the conditional probabilities at each node by following the diagram and propagating the information through the network, updating the belief about each variable in the network. The inventor of Bayesian networks, Judea Pearl, likens this updating process to living organic tissue, and to a biological network of neurons, where if you excite one neuron, the whole network reacts, propagating the information from one neuron to its neighbors.

最后,我们可以想到神经网络作为贝叶斯网络。

Finally, we can think of neural networks as Bayesian networks.

从数据中建立学习模式模型

Models Learning Patterns from the Data

值得注意的是,贝叶斯网络和任何其他学习给定数据特征的联合概率分布的模型,例如我们在上一章生成模型中遇到的模型,以及前面章节中的确定性模型,仅检测从数据中找出模式并学习关联,而不是了解导致这些模式的原因。对于一个真正像人类一样智能和推理的人工智能体来说,它必须就它所看到的和所做的事情提出“如何” 、 “为什么”和“如果”问题,并且必须寻求答案,就像人类在很小的时候所做的那样。事实上,对于人类来说,在成长的早期就有一个“为什么”的年龄:这个年龄让孩子们让父母疯狂地问每件事、任何事的“为什么” 。人工智能代理应该有一个因果模型。这个概念对于实现通用人工智能非常重要。我们将在下一节中讨论它,并在第 11 章概率论中再次讨论它。

It is important to note that Bayesian networks, and any other models learning the joint probability distributions of the features of the given data, such as the models we encountered in the previous chapter on generative models, and the deterministic models in earlier chapters, only detect patterns from the data and learn associations, as opposed to learning what caused these patterns to begin with. For an AI agent to be truly intelligent and reason like humans, it must ask the questions how, why, and what if about what it sees and what it does, and it must seek answers, like humans do at a very early age. In fact, for humans, there is the age of why early in their development: it is the age when children drive their parents crazy asking why about everything and anything. An AI agent should have a causal model. This concept is so important for attaining general AI. We will visit it in the next section and one more time in Chapter 11 on probability.

概率因果建模的图表

Graph Diagrams for Probabilistic Causal Modeling

自从迈出统计的第一步,我们所听到的一切是:相关性不是因果性。然后我们继续讨论数据、更多数据以及数据之间的相关性。好吧,我们明白了,但是因果关系呢?它是什么?我们如何量化它?作为人类,我们确切知道“为什么”意味着什么吗?即使在八个月大的时候,我们也会凭直觉概念化因果关系。事实上,我认为我们在“为什么”的世界中比在“关联”的世界中更多地在自然和直观的层面上运作。那么,为什么我们的机器(我们期望在某个时候能够像我们一样进行推理)只在关联和回归级别上起作用?这就是数学家、哲学家朱迪亚·珀尔 (Judea Pearl) 提出的观点。他在人工智能领域以及他的精彩著作《为什么之书》(Basic Books,2020 年)中提出了这一观点。这本书中我最喜欢的一句话是:“非因果相关性违反了我们的常识。” 我们的想法是,我们需要阐明和量化哪些相关性是由于因果关系,哪些相关性是由于其他因素。

Since taking our first steps in statistics, all we have heard was: correlation is not causation. Then we go on and on about data, more data, and correlations in data. Alright then, we get the message, but what about causation? What is it? How do we quantify it? As humans, do we know exactly what why means? We conceptualize cause and effect intuitively, even at eight months old. I actually think we function more on a natural and intuitive level in the world of why than in the world of association. Why then (see?) do our machines, which we expect that at some point will be able to reason like us, only function at an association and regression level? This is the point that mathematician and philosopher Judea Pearl argues for in the field of AI, and in his wonderful book The Book of Why (Basic Books, 2020). My favorite quote from this book: “Noncausal correlation violates our common sense.” The idea is that we need to both articulate and quantify which correlations are due to causation and which are due to some other factors.

Pearl 使用类似于贝叶斯网络图的图表(图表)构建了他的数学因果关系模型,但赋予了基于do演算,或给定 do运算符的计算概率,而不是给定observe 运算符的计算概率,后者在非因果统计模型中非常常见。要点是:

Pearl builds his mathematical causality models using diagrams (graphs) that are similar to Bayesian network graphs, but endowed with a probabilistic reasoning scheme based on the do calculus, or computing probabilities given the do operator, as opposed to computing probabilities given the observe operator, which is very familiar from noncausal statistical models. The main point is this:

观察不等于做。在数学符号中,Prob (公交车乘客人数|颜色编码路线)与Prob(公交车乘客人数|颜色编码路线)不同。

Observing is not the same as doing. In math notation, Prob(number of bus riders|color-coded routes) is not the same as Prob(number of bus riders|do color-coded routes).

我们可以从数据中推断出第一个。鉴于某个城市的公交路线是用颜色编码的,因此查找乘客数量。这个概率并没有告诉我们颜色编码的路线对乘客数量的影响。使用 do 运算符的第二个概率是不同的,并且仅靠数据(没有因果图)无法告诉我们答案。不同之处在于,当我们调用do运算符时,我们会故意将公交车路线更改为颜色编码,并且我们希望评估该更改对公交车乘客量的影响。如果经过刻意这样做后客流量增加,并且考虑到我们绘制了正确的图表,包括变量以及它们如何相互通信,那么我们可以断言使用颜色编码的公交路线导致了客流量的增加。当我们这样做而不是观察时,我们手动封锁了所有可能自然导致彩色公交路线的道路,例如领导层的变化或一年中可能影响客流量的时间。如果我们只是观察数据,我们无法封锁这些道路。此外,当我们使用do运算符时,我们有意手动将公交路线的值设置为颜色编码(而不是编号等)。

We can infer the first one from the data. Look for the ridership numbers given that the bus routes in a certain city are color-coded. This probability does not tell us the effect color-coded routes have on the number of riders. The second probability, with the do operator, is different, and the data alone, without a causal diagram, cannot tell us the answer. The difference is when we invoke the do operator, then we are deliberately changing the bus routes to color-coded, and we want to assess the effect of that change on bus ridership. If the ridership increases after this deliberate doing, and given that we drew the correct graph, including the variables and how they talk to each other, then we can assert that using color-coded bus routes caused the increase in ridership. When we do instead of observe, we manually block all the roads that could naturally lead to colored bus routes, such as a change in leadership or a time of the year that might affect ridership. We cannot block these roads if we were simply observing the data. Moreover, when we use the do operator, we intentionally and manually set the value of bus routes to color-coded (as opposed to numbered, etc.)

事实上,公交车路线的例子并不是编出来的。我目前正在与弗吉尼亚州哈里森堡的公共交通部门合作,目标是在资源有限且大学休学期间城市人口急剧下降的情况下增加乘客量、提高效率并优化运营。2019年,交通部门特意将线路从数字制改为颜色编码制,同时特意将其时刻表从适应大学上课时间表改为固定时刻表。事情是这样的:他们的乘客量猛增了 18%。你敢打赌,我今年夏天(2022 年)从事该项目的学生很快就会画出因果图并写出如下所示的概率:

Actually, the bus routes example is not made up. I am currently collaborating with the department of public transportation in Harrisonburg, Virginia, with the goals of increasing their ridership, improving their efficiency, and optimizing their operations given both limited resources and a drastic drop in the city’s population when the university is not in session. In 2019, the transportation department deliberately changed its routes from a number system to a color-coded system, and at the same time deliberately changed its schedules from adaptive to the university’s class schedule to fixed schedules. Here is what happened: their ridership increased a whopping 18%. You bet my students who are working on the project this summer (2022) will soon be drawing causal diagrams and writing probabilities that look like:

r d e r s H p | 颜色 编码的 路线 , 固定的 时间表

一台非常擅长检测模式并对其采取行动的机器——就像一只蜥蜴观察一只飞来飞去的虫子,学习它的模式,然后抓住它并吃掉它——与能够推理的机器有着截然不同的智力水平比单纯检测模式有两个更高的层次:

A machine that is so good at detecting a pattern and acting on it–like a lizard observing a bug fly around, learning its pattern, then catching it and eating it–has a very different level of intellect than a machine that is able to reason on two higher levels than mere detection of patterns:

  1. 如果我故意采取这个行动,[在此插入变量]会发生什么?

  2. If I deliberately take this action, what will happen to [insert variable here]?

  3. 如果我不采取这个行动,[取某个值的变量]还会发生吗?如果哈里森堡没有改用颜色编码的路线和固定的时间表,客流量还会增加吗?如果这些变量中只有一个发生变化而不是同时发生变化怎么办?

  4. If I didn’t take this action, would [the variable taking a certain value] still have happened? If Harrisonburg did not move to color-coded routes and to fixed schedules, would the ridership still have increased? What if only one of these variables changed instead of both?

仅靠数据无法回答这些问题。事实上,精心构建的因果图可以帮助我们区分什么时候可以单独使用数据来回答这些问题,什么时候无论我们收集多少数据都无法回答这些问题。在我们的机器被赋予代表因果推理的图表之前,我们的机器具有与蜥蜴相同的智力水平。令人惊讶的是,人类可以即时完成所有这些计算,尽管多次得出错误的结论,并就因果关系争论了数十年。我们仍然需要数学和图表来解决问题。事实上,该图指导我们并告诉我们必须查找和收集哪些数据、要以哪些变量为条件以及要对哪些变量应用 do运算符。这种有意的设计和推理与积累大量数据或漫无目的地调节各种变量的文化非常不同。

The data alone cannot answer these questions. In fact, carefully constructed causal diagrams help us tell apart the times when we can use the data alone to answer these questions, and when we cannot answer them irrespective of how much more data we collect. Until our machines are endowed with graphs representing causal reasoning, our machines have the same level of intellect as lizards. Amazingly, humans do all these computations instantaneously, albeit arriving at the wrong conclusions many times and arguing with each other about causes and effects for decades. We still need math and graphs to settle matters. In fact, the graph guides us and tells us which data we must look for and collect, which variables to condition on, and which variables to apply the do operator on. This intentional design and reasoning is very different than the culture of amassing big volumes of data, or aimlessly conditioning on all kinds of variables.

现在我们知道了这一点,我们可以绘制图表并设计模型来帮助我们解决各种因果问题。我的痊愈是因为医生的治疗,还是因为时间流逝,生活平静下来?我们仍然需要收集和组织数据,但这个过程现在将是有意的和有指导的。具有这些推理方式的机器——因果图模型,以及与因果图模型相关的(非常短的)有效操作列表——将能够回答所有三个因果关系级别的查询:

Now that we know this, we can draw diagrams and design models that can help us settle all kinds of causal questions. Did I heal because of the doctor’s treatment or did I heal because time has passed and life has calmed down? We would still need to collect and organize data, but this process will now be intentional and guided. A machine endowed with these ways of reasoning—a causal diagram model, along with the (very short) list of valid operations that go with the causal diagram model—will be able to answer queries on all three levels of causation:

  • 变量 A 和 B 相关吗?公交线路的客流量和标签是否相关?

  • Are variables A and B correlated? Are ridership and labeling of bus routes correlated?

  • 如果我将变量 A 设置为特定值,变量 B 将如何变化?如果我故意设置颜色编码的路线,客流量会增加吗?

  • If I set variable A to a specific value, how would variable B change? If I deliberately set color-coded routes, would ridership increase?

  • 如果变量A没有取某个值,变量B会改变吗?如果我不改变颜色编码的公交路线,乘客量还会增加吗?

  • If variable A did not take a certain value, would variable B have changed? If I did not change to color-coded bus routes, would ridership still have increased?

我们仍然需要学习如何处理涉及do运算符的概率表达式。我们已经确定,“看”与“做”不同:“看”是在数据中,而“做”是故意进行实验来评估某个变量对另一个变量的因果影响。它比仅仅计算数据中看到的比例成本更高。Pearl 建立了三个用于操作涉及do运算符的概率表达式的规则。这些帮助我们从用做的表达转向只用看到的其他人,在那里我们可以从数据中得到答案。这些规则很有价值,因为它们使我们能够通过观察而不是来量化因果效应。我们请在第 11 章概率论中回顾这些内容。

We still need to learn how to deal with probability expressions that involve the do operator. We have established that seeing is not the same as doing: seeing is in the data, while doing is deliberately running an experiment to assess the causal effect of a certain variable on another. It more costly than just counting proportions seen in the data. Pearl establishes three rules for manipulating probability expressions that involve the do operator. These help us move from expressions with doing to others with only seeing, where we can get the answers from the data. These rules are valuable because they enable us to quantify causal effects by seeing, bypassing doing. We go over these in Chapter 11 on probability.

图论简史

A Brief History of Graph Theory

我们不可以本章没有对图论和该领域的现状进行很好的概述。这个领域建立在如此简单的基础上,但它是美丽的、刺激的,并且具有深远的应用,这让我重新评估我的整个数学职业道路并尝试紧急转换。

We cannot leave this chapter without a good overview of graph theory and the current state of the field. This area is built on such simple foundations, yet it is beautiful, stimulating, and has far-reaching applications that made me reassess my whole mathematical career path and try to convert urgently.

图论的词汇包括图、节点、边、度、连通性、树、生成树、电路、基本电路、图的向量空间、秩和零(如线性代数)、对偶性、路径、游走、欧拉线、哈密顿电路、剪切、网络流、遍历、着色、枚举、链接和漏洞。

The vocabulary of graph theory includes graphs, nodes, edges, degrees, connectivity, trees, spanning trees, circuits, fundamental circuits, vector space of a graph, rank and nullity (like in linear algebra), duality, path, walk, Euler line, Hamiltonian circuit, cut, network flow, traversing, coloring, enumerating, links, and vulnerability.

图论的发展时间表很有启发性,其根源在于交通系统、地图和地理、电路以及化学中的分子结构:

The timeline of the development of graph theory is enlightening, with its roots in transportation systems, maps and geography, electric circuits, and molecular structures in chemistry:

  • 1736年,欧拉发表了第一篇图论论文,解决了柯尼斯堡桥问题。然后一百多年来,这个领域什么也没发生。

  • In 1736, Euler published the first paper in graph theory, solving the Königsberg bridge problem. Then nothing happened in the field for more than a hundred years.

  • 1847 年,基尔霍夫在研究电网时发展了树木理论。

  • In 1847, Kirchhoff developed the theory of trees while working on electrical networks.

  • 不久之后,即 1850 年代,凯利在尝试列举饱和碳氢化合物的异构体时发现了树木 C n H 2n+2 。亚瑟·凯利(1821-1895)是图论的创始人之一。我们在任何有图表数据的地方都能找到他的名字。最近,CayleyNet使用称为凯莱多项式的复杂有理函数作为谱域方法来对图数据进行深度学习。

  • Shortly after, in the 1850s, Cayley discovered trees as he was trying to enumerate the isomers of saturated hydrocarbons C n H 2n+2 . Arthur Cayley (1821–1895) is one of the founding fathers of graph theory. We find his name anywhere there is graph data. More recently, CayleyNet uses complex rational functions called Cayley polynomials for a spectral domain approach for deep learning on graph data.

  • 在同一时期,即 1850 年,威廉·汉密尔顿爵士 (Sir William Hamilton)发明了成为汉密尔顿电路基础的游戏,并在都柏林出售。我们有一个木制的正多面体,有 12 个面和 20 个角;每个面都是正五边形,每个角都有 3 条边相交。20个角上有伦敦、罗马、纽约、孟买、德里、巴黎等20个城市的名字。我们必须找到一条沿着多面体边缘的路线,穿过 20 个城市中的每一个城市一次(哈密顿回路)。这个特定问题的解决很容易,但是到目前为止,我们还没有在任意图中存在这样一条路线的充分必要条件。

  • During the same time period, in 1850, Sir William Hamilton invented the game that became the basis of Hamiltonian circuits, and sold it in Dublin. We have a wooden, regular polyhedron with 12 faces and 20 corners; each face is a regular pentagon and 3 edges meet at each corner. The 20 corners have the names of 20 cities, such as London, Rome, New York, Mumbai, Delhi, Paris, and so on. We have to find a route along the edges of the polyhedron, passing through each of the 20 cities exactly once (a Hamiltonian circuit). The solution of this specific problem is easy, but until now we’ve had no necessary and sufficient condition for the existence of such a route in an arbitrary graph.

  • 同样在同一时期,莫比乌斯的一次演讲(1840 年代)、德摩根的一封信(1850 年代)以及凯利在《皇家地理学会学报》第一卷(1879 年)中发表的文章中,最著名的图论问题(1970 年解决)四色定理,栩栩如生。从那时起,这一直困扰着许多数学家,并带来了许多有趣的发现。它指出四种颜色足以为平面上的任何地图着色,以便具有共同边界的国家具有不同的颜色。有趣的是,如果我们通过移出平面(例如移至球体表面)来给自己更多的空间,那么我们确实有这个猜想的解决方案。

  • Also during the same time period, at a lecture by Möbius (1840s), a letter by De Morgan (1850s), and in a publication by Cayley in the first volume of the Proceedings of the Royal Geographic Society (1879), the most famous problem in graph theory (solved in 1970), the four color theorem, came to life. This has occupied many mathematicians since then, leading to many interesting discoveries. It states that four colors are sufficient for coloring any map on a plane such that the countries with common boundaries have different colors. The interesting thing is that if we give ourselves more space by moving out of the flat plane, for example to the surface of a sphere, then we do have solutions for this conjecture.

  • 不幸的是,接下来的 70 年左右没有任何进展,直到 1920 年代 König 写了第一本关于这个主题的书并于 1936 年出版。

  • Unfortunately, nothing happened for another 70 years or so, until the 1920s when König wrote the first book on the subject and published it in 1936.

  • 随着计算机的出现以及它们探索组合性质的大问题的能力不断增强,事情发生了变化。这刺激了纯图论和应用图论的激烈活动。目前已有数千篇论文和数十本关于该主题的书籍,重要贡献者包括 Claude Berge、Oystein Ore、Paul Erdös、William Tutte 和弗兰克·哈拉里.

  • Things changed with the arrival of computers and their increasing ability to explore large problems of combinatorial nature. This spurred intense activity in both pure and applied graph theory. There are now thousands of papers and dozens of books on the subject, with significant contributors such as Claude Berge, Oystein Ore, Paul Erdös, William Tutte, and Frank Harary.

图论的主要考虑因素

Main Considerations in Graph Theory

让我们整理一下图论中的主要主题,并以鸟瞰图为目标,而不需要深入细节:

Let’s organize the main topics in graph theory and aim for a bird’s-eye view without diving into details:

生成树和最短生成树

Spanning Trees and Shortest Spanning Trees

这些是非常重要,并用于网络路由协议、最短路径算法和搜索算法。图的生成树是一个子图,它是一棵树(任何两个顶点都可以使用一条唯一路径连接),包括图的所有顶点。也就是说,生成树将图的顶点保持在一起。同一个图可以有许多生成树。

These are of great importance and are used in network routing protocols, shortest path algorithms, and search algorithms. A spanning tree of a graph is a subgraph that is a tree (any two vertices can be connected using one unique path) including all of the vertices of the graph. That is, spanning trees keep the vertices of a graph together. The same graph can have many spanning trees.

割集和割顶点

Cut Sets and Cut Vertices

我们可以通过切割足够的边,或者有时通过删除足够的顶点,将任何连接的图分开,断开连接。如果我们能够在给定的图中(例如在通信网络、电网、运输网络或其他网络中)找到这些割集,我们就可以切断其断开部分之间的所有通信方式。通常我们对最小或最小割集感兴趣,它将通过删除最少量的边或顶点来完成断开图的任务。这有助于我们识别网络中最薄弱的环节。与生成树相反,割集将顶点分开,而不是将所有顶点保持在一起。因此,我们正确地期望生成树和割集之间存在密切关系。此外,如果该图表示一个具有某种源(例如流体、流量、电力或信息)和一个汇的网络,其中每条边仅允许一定量的流量通过它,那么它们之间存在密切关系可以从源移动到接收器的最大流量以及通过图的边缘进行的切割,将源与接收器断开,并且切割边缘的总容量最小。这就是最大流最小割定理指出在流网络中,从源到汇的最大流量等于最小切割中边的总权重。在数学中,当最大化问题(最大流)以一种不平凡的方式(例如,不只是翻转目标函数的符号)等同于最小化问题(最小割)时,它表示对偶性。事实上,图的最大流最小割定理是线性优化对偶定理的一个特例。

We can break any connected graph apart, disconnecting it, by cutting through enough edges, or sometimes by removing enough vertices. If we are able to find these cut sets in a given graph, such as in a communication network, an electrical grid, a transportation network, or others, we can cut all communication means between its disconnected parts. Usually we are interested in the smallest or minimal cut sets, which will accomplish the task of disconnecting the graph by removing the least amount of its edges or vertices. This helps us identify the weakest links in a network. In contrast to spanning trees, cut sets separate the vertices, as opposed to keeping all of them together. Thus, we would rightly expect a close relationship between spanning trees and cut sets. Moreover, if the graph represents a network with a source of some sort, such as fluid, traffic, electricity, or information, and a sink, where each edge allows only a certain amount to flow through it, then there is a close relationship between the maximum flow that can move from the source to the sink and the cut through the edges of the graph that disconnects the source from the sink, with minimal total capacity of the cut edges. This is the max-flow min-cut theorem, which states that in a flow network, the maximum amount of flow passing from the source to the sink is equal to the total weight of the edges in a minimal cut. In mathematics, when a maximization problem (max flow) becomes equivalent to a minimization problem (min cut) in a nontrivial way (for example, by not just flipping the sign of the objective function), it signals duality. Indeed, the max-flow min-cut theorem for graphs is a special case of the duality theorem from linear optimization.

平面度

Planarity

是个平面或三维图形的几何表示?也就是说,我们能否在一个平面上绘制图的顶点并连接其边,并且边不互相交叉?这对于复杂系统的自动布线、印刷电路和大规模集成电路等技术应用来说很有趣。对于非平面图,我们感兴趣的是这些图的厚度和边之间的交叉数量等属性。平面图的等价条件是存在对偶图,其中图与其对偶之间的关系在图的向量空间的上下文中变得清晰。线性代数和图形在这里聚集在一起,代数和组合表示回答有关几何图形的问题,反之亦然。对于平面性问题,我们只需要考虑简单的、不可分的图,其顶点都具有三度或更多度数。此外,任何边数大于其顶点数三倍减六的图都是非平面的。该研究领域还有许多未解决的问题。

Is the geometric representation of a graph planar or three-dimensional? That is, can we draw the vertices of the graph and connect its edges, all in one plane, without its edges crossing each other? This is interesting for technological applications such as automatic wiring of complex systems, printed circuits, and large-scale integrated circuits. For nonplanar graphs, we are interested in properties such as the thickness of these graphs and the number of crossings between edges. An equivalent condition for a planar graph is the existence of a dual graph, where the relationship between a graph and its dual becomes clear in the context of the vector space of a graph. Linear algebra and graphs come together here, where algebraic and combinatoric representations answer questions about geometric figures and vice versa. For the planarity question, we need to consider only simple, nonseparable graphs whose vertices all have three degrees or more. Moreover, any graph with a number of edges larger than three times the number of its vertices minus six is nonplanar. There are many unsolved problems in this field of study.

作为向量空间的图

Graphs as Vector Spaces

这是将图理解为几何对象和代数对象以及两种表示之间的对应关系非常重要。图表就是这种情况。每个图都对应于整数模 2 域上的e维向量空间,其中e是图的边数。所以如果图只有三个边 e d G e 1 , e d G e 2 , e d G e 3 ,那么它对应于包含向量的三维向量空间 0 , 0 , 0 , 1 , 0 , 0 , 0 , 1 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 1 , 0 , 0 , 1 , 1 , 1 , 1 。这里 0 , 0 , 0 对应于不包含三个边的空子图, 1 , 1 , 1 对应于包含所有三个边的完整图, 0 , 1 , 1 对应于只包含 e d G e 2 e d G e 3 , 等等。

It is important to understand a graph as both a geometric object and an algebraic object, along with the correspondence between the two representations. This is the case for graphs. Every graph corresponds to an e dimensional vector space over the field of integers modulo 2, where e is the number of edges of the graph. So if the graph only has three edges e d g e 1 , e d g e 2 , e d g e 3 , then it corresponds to the three-dimensional vector space containing the vectors ( 0 , 0 , 0 ) , ( 1 , 0 , 0 ) , ( 0 , 1 , 0 ) , ( 1 , 1 , 0 ) , ( 1 , 0 , 1 ) , ( 0 , 1 , 1 ) , ( 0 , 0 , 1 ) , ( 1 , 1 , 1 ) . Here ( 0 , 0 , 0 ) corresponds to the null subgraph containing none of the three edges, ( 1 , 1 , 1 ) corresponds to the full graph containing all three edges, ( 0 , 1 , 1 ) corresponds to the subgraph containing only e d g e 2 and e d g e 3 , and so on.

整数域模 2

Field of Integers Modulo 2

整数模 2 的字段仅包含两个元素 0 和 1,操作为 + 和 × 两者都以模 2 进行。实际上,它们相当于布尔逻辑中的逻辑运算xor(异或运算符)和and 。向量空间必须在一个域上定义,并且必须在其向量乘以来自该域的标量的情况下封闭。在这种情况下,标量只有 0 和 1,并且乘法以模 2 进行。因此,图是有限域上向量空间的很好的例子,这与通常的实数或复数不同。图的向量空间的维数是图的边e的数量,并且该向量空间中的向量总数为 2 e。我们可以在这里看到图论如何立即适用于开关电路(带有开关)、数字系统和信号,因为它们都在整数模 2 的领域中运行。

The field of integers modulo 2 only contains the two elements 0 and 1, with the operations + and × both happening modulo 2. These are in fact equivalent to the logical operations xor (exclusive or operator) and and in Boolean logic. A vector space has to be defined over a field and has to be closed under multiplication of its vectors by scalars from that field. In this case, the scalars are only 0 and 1, and the multiplication happens modulo 2. Graphs are therefore nice examples of vector spaces over finite fields, which are different than the usual real or complex numbers. The dimension of the vector space of a graph is the number of edges e of the graph, and the total number of vectors in this vector space is 2e. We can see here how graph theory is immediately applicable to switching circuits (with on and off switches), digital systems, and signals, since all operate in the field of integers modulo 2.

有了这种简单的对应关系,并得到整个线性代数领域的支持,我们很自然地可以在向量的背景下尝试理解割集、电路、基本电路、生成树和其他重要的图子结构以及它们之间的关系。子空间、基、交集、正交性以及这些子空间的维数。

With this simple correspondence, and backed by the whole field of linear algebra, it is natural to try to understand cut sets, circuits, fundamental circuits, spanning trees, and other important graph substructures, and the relationships among them, in the context of vector subspaces, basis, intersections, orthogonality, and dimensions of these subspaces.

可实现性

Realizability

我们有已经使用邻接矩阵和关联矩阵作为完整描述图的矩阵表示。其他矩阵描述了图的重要特征,例如电路矩阵、割集矩阵和路径矩阵。当然,相关研究必须涉及所有这些如何相互关联和相互作用。

We have already used the adjacency matrix and the incidence matrix as matrix representations that completely describe a graph. Other matrices describe important features of the graph, such as the circuit matrix, the cut set matrix, and the path matrix. Then, of course, the relevant studies have to do with how all these relate to each other and interact.

另一个非常重要的话题是可实现性:给定的矩阵必须满足什么条件才能成为某个图的电路矩阵?

Another very important topic is that of realizability: what conditions must a given matrix satisfy so that it is the circuit matrix of some graph?

着色与搭配

Coloring and Matching

在许多我们有兴趣为图的节点、边甚至平面图中的区域分配标签或颜色。著名的图着色问题是当我们要分配给每个节点的颜色使得相邻顶点没有相同的颜色时;此外,我们希望使用最少的颜色来做到这一点。为图表着色所需的最少颜色数称为它的色数。与着色相关的主题包括节点划分、覆盖和色多项式等。

In many situations we are interested in assigning labels, or colors, to the nodes, edges of a graph, or even regions in a planar graph. The famous graph coloring problem is when the colors we want to assign to each node are such that no neighboring vertices get the same color; moreover, we want to do this using the minimal amount of colors. The smallest number of colors required to color a graph is called its chromatic number. Related to coloring are topics such as node partitioning, covering, and the chromatic polynomial.

配套的是一组没有两条边相邻的边。最大匹配是没有两个相邻的边的最大集合。一般图和二部图中的匹配有很多应用,例如匹配最小的类集以满足毕业要求,或者将工作分配与员工偏好相匹配(这最终成为最大流最小割问题)。我们可以使用基于随机游走的算法来在大型二分图上找到完美匹配。

A matching is a set of edges where no two are adjacent. A maximal matching is a maximal set of edges where no two are adjacent. Matching in a general graph and in a bipartite graph have many applications, such as matching a minimal set of classes to satisfy graduation requirements, or matching job assignments to employee preferences (this ends up being a max-flow min-cut problem). We can use random walk based algorithms to find perfect matchings on large bipartite graphs.

枚举

Enumeration

凯莱在1857年对计算饱和烃异构体的数量感兴趣 C n H 2n+2 ,这使他计算了具有n 个节点的不同树的数量,并导致他对图论的贡献。有很多类型的图表可供列举,其中许多都是他们自己研究论文的主题。示例包括枚举所有有根树、简单图、简单有向图以及其他具有特定属性的树。枚举是图论中的一个巨大领域。一种重要的枚举技术是 Pólya 计数定理,其中需要找到一个合适的置换群,然后获得其循环索引,这是非常重要的。

Cayley in 1857 was interested in counting the number of isomers of saturated hydrocarbon C n H 2n+2 , which led him to count the number of different trees with n nodes, and to his contributions to graph theory. There are many types of graphs to be enumerated, and many have been the topics of their own research papers. Examples include enumerating all rooted trees, simple graphs, simple digraphs, and others possessing specific properties. Enumeration is a huge area in graph theory. One important enumeration technique is Pólya’s counting theorem, where one needs to find an appropriate permutation group and then obtain its cycle index, which is nontrivial.

图的算法和计算方面

Algorithms and Computational Aspects of Graphs

算法计算机实现对于任何从事图形建模的人来说都具有巨大的价值。存在用于传统图论任务的算法,例如:

Algorithms and computer implementations are of tremendous value for anyone working with graph modeling. Algorithms exist for traditional graph theoretical tasks such as:

  • 判断图是否可分。

  • Find out if graph is separable.

  • 判断图是否连通。

  • Find out if a graph is connected.

  • 找出图的组成部分。

  • Find out the components of a graph.

  • 找到图的生成树。

  • Find the spanning trees of a graph.

  • 找到一组基本电路。

  • Find a set of fundamental circuits.

  • 求割集。

  • Find cut sets.

  • 找到从给定节点到另一个节点的最短路径。

  • Find the shortest path from a given node to another.

  • 测试图形是否是平面的。

  • Test whether the graph is planar.

  • 构建具有特定属性的图。

  • Build a graph with specific properties.

如今,图神经网络带有自己的开源包。与往常一样,算法要具有实际用途,就必须高效。其运行时间不得随图的节点数量呈阶乘甚至指数增长。它应该是多项式时间,与 n k ,其中k优选是一个小数。

Graph neural networks nowadays come with their own open source packages. As always, for an algorithm to be of any practical use it must be efficient. Its running time must not increase factorially or even exponentially with the number of nodes of the graph. It should be polynomial time, proportional to n k , where k is preferably a low number.

对于任何希望进入图建模领域的人来说,熟悉图建模的理论和计算方面非常有用图表。

For anyone wishing to enter the field of graph modeling, it is of great use to familiarize yourself with both the theory and the computational aspects of graphs.

总结与展望

Summary and Looking Ahead

本章总结了图形建模的各个方面,重点是示例、应用程序和构建直觉。对于想要深入研究的读者来说,有很多参考资料。主要信息是,不要在没有目标或不了解大局、该领域的现状以及它与人工智能的关系的情况下迷失在杂草中(它们非常茂密)。

This chapter was a summary of various aspects of graphical modeling, with emphasis on examples, applications, and building intuition. There are many references for readers aiming to dive deeper. The main message is not to get lost in the weeds (and they are very thick) without an aim or an understanding of the big picture, the current state of the field, and how it relates to AI.

我们还介绍了图上的随机游走、贝叶斯网络和概率因果模型,这将我们的大脑更多地转向概率思维的方向,这是第 11 章的主题。在进入概率的数学章节(第 11 章)之前,我一直打算先回顾一下人工智能中概率的各种用途。

We also introduced random walks on graphs, Bayesian networks, and probabilistic causal models, which shifted our brain even more in the direction of probabilistic thinking, the main topic of Chapter 11. It was my intention all along to go over all kinds of uses for probability in AI before going into a math chapter on probability (Chapter 11).

我们在本章结束时写了一篇非常好的读物:“关系归纳偏差、深度学习和图网络”(Battaglia 等人,2018 年),它为深度学习社区采用图网络提供了理由:

We leave this chapter with this very nice read: “Relational Inductive Biases, Deep Learning, and Graph Networks” (Battaglia et al. 2018), which makes the case for the deep learning community to adopt graph networks:

我们为人工智能工具包提供了一个新的构建块,具有很强的关系归纳偏差——图网络——它概括和扩展了在图上运行的神经网络的各种方法,并提供了一个简单的界面来操纵结构化知识和产生结构化行为。我们讨论图网络如何支持关系推理和组合泛化,为更复杂、可解释和灵活的推理模式奠定基础。

We present a new building block for the AI toolkit with a strong relational inductive bias—the graph network—which generalizes and extends various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. We discuss how graph networks can support relational reasoning and combinatorial generalization, laying the foundation for more sophisticated, interpretable, and flexible patterns of reasoning.

第 10 章运筹学

Chapter 10. Operations Research

许多科学家的伟大不在于他们解决问题的技巧,而在于他们选择问题的智慧。

E. Bright Wilson (1908–1992),美国化学家

Many scientists owe their greatness not to their skill in solving problems but to their wisdom in choosing them.

E. Bright Wilson (1908–1992), American chemist

本章,我们探索人工智能与运筹学领域的整合,利用两全其美的优势来做出更高效、更明智的决策。尽管这个介绍性陈述听起来像广告,但这正是运筹学的全部内容。机器学习的进步只会有助于推动该领域向前发展。

In this chapter, we explore the integration of AI into the field of operations research, leveraging the best of both worlds for more efficient and more informed decision making. Although this introductory statement sounds like an ad, it is precisely what operations research is all about. Advances in machine learning can only help move the field forward.

运筹学是应用数学中最具吸引力和刺激性的领域之一。它是一门以最节省时间和成本效益的方式平衡不同需求和可用资源的科学。运筹学中的许多问题都归结为寻找一个最佳点,即圣杯,使一切顺利有效地运行:没有备份、不中断及时服务、没有浪费、平衡成本以及为每个参与者带来良好的收入。许多应用程序从未找到圣杯,但许多运筹学方法使我们非常接近,至少对于复杂现实的简化模型而言。约束数学优化渗透到每个行业、每个网络以及我们生活的各个方面。如果处理得当,我们就能享受到它的好处;如果做得不当,我们就会受到影响:全球和地方经济仍在经历 COVID-19、乌克兰战争以及供应链中断的影响。

Operations research is one of the most attractive and stimulating areas of applied mathematics. It is the science of balancing different needs and available resources in the most time- and cost-efficient ways. Many problems in operations research reduce to searching for an optimal point, the holy grail, at which everything functions smoothly and efficiently: no backups, no interruptions to timely services, no waste, balanced costs, and good revenues for everyone involved. A lot of applications never find the holy grail, but many operations research methods allow us to come very close, at least for the simplified models of the complex reality. Constrained mathematical optimization penetrates every industry, every network, and every aspect of our lives. Done properly, we enjoy its benefits; done improperly, we suffer its impact: global and local economies are still experiencing the ramifications of COVID-19, the war on Ukraine, and the interruptions to the supply chain.

在探讨机器学习如何开始进入运筹学之前,我们强调一些感兴趣的人如果想参与该领域就必须内化的想法。由于我们只有一章的时间来讨论这个美丽的主题,因此我们必须将其提炼为精髓:

Before exploring how machine learning is starting to make its way into operations research, we highlight a few ideas that an interested person must internalize if they want to get involved in the field. Since we only have one chapter to spend on this beautiful topic, we must distill it into its essence:

没有免费的午餐定理
The no free lunch theorem

使我们将注意力转移到设计和分析最适合当前特殊情况的方法上,而不是像许多数学家自然倾向于那样寻找最通用和最广泛适用的方法。它本质上要求所有这些数学家保持冷静,并对特定类型问题的专门解决方案感到满意。

This makes us shift our attention into devising and analyzing methods that work best for the special case scenario at hand, as opposed to looking for the most general and most widely applicable methods, like many mathematicians are naturally inclined to do. It essentially asks all these mathematicians to pretty much chill, and be satisfied with specialized solutions for specific types of problems.

问题的复杂性分析和算法的渐近分析
Complexity analysis of problems and asymptotic analysis of algorithms

渐近分析告诉我们我们认为,即使该算法非常创新和天才,但如果其计算需求随着问题的规模而猛增,那么它也是毫无用处的。运筹学解决方案需要扩展到具有许多变量的大场景。复杂性分析另一方面,解决问题本身的难度,而不是解决问题的算法。组合问题,即 n ,非常糟糕: n 大于 k n 对于n足够大,但是是指数的 k n 复杂性已经很糟糕了!

Asymptotic analysis tells us that even if the algorithm is ultra innovative and genius, it is useless if its computational requirements skyrocket with the size of the problem. Operations research solutions need to scale to big scenarios with many variables. Complexity analysis, on the other hand, addresses the level of difficulty of the problems themselves rather than the algorithms devised to tackle them. Combinatorial problems, which are O ( n ! ) , are ultra bad: n ! is bigger than k n for n large enough, but an exponential k n complexity would already be very bad!

运筹学的重要主题和应用
Important topics and applications in operations research

这些我们可以在任何一本关于运筹学的好书中找到。我们总是需要随身携带一个。为了蓬勃发展这一领域,从特定应用和业务目标转向数学公式是一项再强调也不为过的技能。

These we can find in any good book on operations research. We always need to keep one at hand. Moving from a specific application and business objectives to a mathematical formulation is the skill that cannot be stressed enough in order to thrive this field.

各类优化方法和算法
Various types of optimization methods and algorithms

这是运筹学解决方案和软件包的主力。

This is the workhorse of operations research solutions and software packages.

软件包
Software packages

这些内容的广泛可用性以及有限的页数是我不在本章中详细阐述任何算法或计算的借口。

The wide availability of these, along with the limited number of pages, are my excuse not to elaborate on anything algorithmic or computational in this chapter.

用六个词来概括运筹学:数学公式优化算法软件决策

To sum up operations research in six words: mathematical formulation, optimization, algorithms, software, and decisions.

在阅读本章时,在日常生活中与我们打交道的公司如何管理其运营的背景下思考这些概念是有帮助的。例如,考虑亚马逊的物流。亚马逊是全球最大的电子商务公司。到 2022 年,它在美国电子商务市场的份额为 45%,每天销售和交付数百万件商品,每秒销售额约为 5,000 美元。亚马逊是如何成功做到这一点的?公司如何管理其库存、仓库、运输以及极其高效的交付系统?亚马逊如何制定其子问题并将其整合到一项成功的大型运营中?与运输物流相同,例如优步。Uber 每天在全球范围内提供多达 1500 万次共享乘车服务,将可用的司机与附近的乘客进行匹配、路线安排和上下车时间、定价行程、预测司机收入和供需模式,并执行无数的分析。

When reading through this chapter, it is helpful to think of the concepts in the context of how the companies that we interact with in our daily lives manage their operations. Consider, for example, Amazon’s logistics. Amazon is the largest ecommerce company in the world. Its share of the US ecommerce market in 2022 is 45%, selling and delivering millions of units of merchandise every day with around $5,000 in sales every second. How does Amazon succeed at doing this? How does the company manage its inventory, warehouses, transportation, and extremely efficient delivery system? How does Amazon formulate its subproblems and integrate them into one big successful operation? Same with transportation logistics, such as Uber. Every day, Uber provides up to 15 million shared rides worldwide, matching available drivers with nearby riders, routing and timing pickups and drop-offs, pricing trips, predicting driver revenues and supply-and-demand patterns, and performing countless analytics.

复杂且高度关联的优化问题使得如此庞大的系统能够相对平稳地运行,这是运筹学的典型特征。此外,许多涉及的问题都是 NP 困难的(在计算复杂性方面,这意味着它们具有非确定性多项式时间硬度水平;在英语中,计算成本非常昂贵)。再加上它们的随机性,我们还有一些有趣的数学问题需要解决。

The complex and highly interconnected optimization problems that allow such massive systems to run relatively smoothly are typical to operations research. Moreover, a lot of the involved problems are NP-hard (in computational complexity, this means they have a nondeterministic polynomial time level of hardness; in English, very expensive to compute). Add to that their stochastic nature, and we have interesting math problems that need to be solved.

总体而言,运筹学的数学方法和算法每年为世界节省数十亿美元。对美国最大 500 家公司的一项调查显示,85% 使用线性规划(这是线性优化的另一个名称,是运筹学的重要组成部分,也是我们在本研究中花费大量时间研究单纯形法和对偶性的原因)章节)。再加上人工智能行业的工具,现在是进入该领域的最佳时机。回报将是多方面的:智力、经济以及对人类更大福祉的有意义的贡献。因此,本章选定的几个主题绝不应该削弱该领域其他同等重要主题的重要性。

Overall, the mathematical methods and algorithms of operations research save the world billions of dollars annually. A survey of the largest 500 companies in the United States showed that 85% use linear programming (which is another name for linear optimization, a massive part of operations research, and a reason we spend some decent time on the simplex method and duality in this chapter). Coupled with tools from the AI industry, now is the perfect time to get into the field. The rewards will be on many levels: intellectual, financial, and a meaningful contribution to the greater good of humanity. So in no way should the few selected topics for this chapter dim the significance of other equally important topics in the field.

要深入了解运筹学(当然,在阅读本章之后),最好的方法是学习最好的:

To dive deeper into operations research (after reading this chapter, of course), the best way is to learn from the best:

没有免费的午餐

No Free Lunch

优化状态没有免费午餐定理没有一种特定的优化算法最适合每个问题。当对所有可能的目标函数进行平均时,寻找目标函数(成本函数、损失函数、效用函数、似然函数)优化器的所有算法都具有相似的性能。因此,如果某种算法在某些类别的目标函数上表现优于另一种算法,则存在其他目标函数,而另一种算法表现更好。不存在适用于所有类型问题的卓越算法。因此,选择算法应该取决于问题(或领域)。根据我们的应用领域,有大量的信息涉及算法从业者使用的算法、他们选择这些算法的理由、他们在高维和合理维度问题上与其他算法的比较,以及他们对更好性能的不断尝试,主要基于两个标准:速度(计算上并不昂贵)和准确性(给出良好的答案)。

The no free lunch theorem for optimization states that there is no one particular optimization algorithm that works best for every problem. All algorithms that look for an optimizer of an objective function (cost function, loss function, utility function, likelihood function) have similar performance when averaged over all possible objective functions. So if some algorithm performs better than another on some class of objective functions, there are other objective functions where the other algorithm performs better. There is no superior algorithm that works for all kinds of problems. Therefore, picking an algorithm should be problem (or domain) dependent. Depending on our application area, there is plenty of information on which algorithms practitioners use, their justifications for why these are their chosen ones, their comparisons with others on both high-dimensional and reasonable-dimension problems, and their constant attempts for better performance, based mostly on two criteria: speed (computationally not expensive), and accuracy (gives good answers).

复杂度分析和 O() 表示法

Complexity Analysis and O() Notation

多次,各种约束下有效分配有限资源的问题归结为设计有效的离散优化算法。线性规划、整数规划、组合优化和图结构(网络)优化都是相互交织的(有时它们只不过是同一事物的两个不同名称)并且处理一个目标:从离散和有限的集合中找到优化有效选项——可行集。如果可行集一开始就不是离散的,如果我们要利用为该领域开发的丰富工具,有时我们可以将其简化为离散集。主要问题是:穷举搜索通常不容易处理。这意味着,如果我们列出可行集中的所有可用选项并评估每个选项的目标函数,我们将花费大量时间来找到给出最佳答案的点。没有人说有限的可行意味着它不是巨大的。我们需要专门的算法来有效地排除大范围的搜索空间。有些算法可以精确地解决某些问题,而另一些算法只能找到近似解,我们别无选择,只能接受。

Many times, the problem of efficiently allocating limited resources under various constraints boils down to devising efficient algorithms for discrete optimization. Linear programming, integer programming, combinatorial optimization, and optimization on graph structures (networks) are all intertwined (sometimes these are nothing more than two different names for the same thing) and deal with one objective: finding an optimizer from a discrete and finite set of valid options—the feasible set. If the feasible set is not discrete to start with, sometimes we can reduce it to a discrete set if we are to take advantage of the wealth of tools developed for this field. Here is the main issue: exhaustive search is usually not tractable. This means that if we list all the available options in the feasible set and evaluate the objective function at each of them, we would spend an ungodly amount of time to find the point(s) that give the optimal answer. No one said that a finite feasible means that it is not enormous. We need specialized algorithms that efficiently rule out large swaths of the search space. Some algorithms pinpoint the exact solution for some problems, while others can only find approximate solutions, which we have no option but to settle for.

现在让我们预先做出以下区分,因为这让很多人感到困惑:

Let’s now make the following distinctions up front, since this confuses a lot of people:

复杂度分析是针对我们要解决的问题(路由、旅行商、背包等)
Complexity analysis is for the problems that we want to solve (routing, traveling salesman, knapsack, etc.)

问题的内在复杂性与用于解决该问题的算法无关。事实上,它有时告诉我们,我们不能指望有一种更有效的算法来解决这类问题,或者在其他情况下我们是否可以做得更好。无论如何,问题的复杂性分析本身就是一门丰富的科学,运筹学领域提供了丰富的复杂问题可供思考。这是出现以下术语的地方:多项式问题、非确定性多项式问题、非确定性多项式完全问题、非确定性多项式时间困难问题、补非确定性多项式问题和补非确定性多项式完全问题。这些术语非常令人困惑,以至于有人需要重新考虑它们的命名法。我们不会在这里定义每个问题(主要是因为理论还没有确定这些类别问题之间的界限),但我们将进行以下划分:可以在多项式时间内或更短时间解决的问题与我们无法解决的问题无论使用什么算法,都可以在多项式时间内找到精确解,在这种情况下,我们必须采用近似算法(例如,旅行商问题)。请注意,有时多项式时间问题可能不是一件好事,因为,例如, n 2000年 毕竟没那么快。

The intrinsic complexity of a problem is independent of the algorithms used to tackle it. In fact, it sometimes tells us that we cannot hope for a more efficient algorithm for such kinds of problems, or whether we can do better in other cases. In any case, complexity analysis for problems is a rich science on its own, and the field of operations research provides a wealth of complex problems to ponder on. This is where the following terms appear: polynomial problem, nondeterministic polynomial problem, nondeterministic polynomial complete problem, nondeterministic polynomial time hard problem, complement nondeterministic polynomial problem, and complement nondeterministic polynomial complete problem. Those terms are so confusing that someone seriously needs to reconsider their nomenclature. We will not define each here (mainly because the theory is not yet set on the boundaries between these classes of problems), but we will make the following divide: problems that can be solved in polynomial time or less, versus problems for which we cannot find an exact solution in polynomial time, no matter what algorithm is used, in which case we have to settle for approximation algorithms (for example, the traveling salesman problem). Note that sometimes polynomial time problems might not be such a great thing, because, for example, O ( n 2000 ) is not so fast after all.

渐近分析是我们设计来解决这些问题的算法
Asymptotic analysis is for the algorithms that we design to solve these problems

这是我们尝试估计算法所需的操作数量并相对于问题的大小对其进行量化的地方。我们通常使用大O表示法。

大 O() 表示法

函数g(n)的复杂度为O(f(n)), G n C F n 对于某个常数c,并且对于所有 n n 0

例如,2n+1O(n) 5 n 3 - 7 n 2 + 1 n 3 , n 2 2 n - 55 n 100 n 2 2 n , 和 15 n G n - 5 n n G n

不要忘记常数渐近情况O(1),其中算法的操作计数与问题的大小无关(很棒的事情,因为这意味着它可以扩展而无需担心巨大的问题)。

对于某些算法,我们可以统计准确的操作次数;例如,要计算长度为n的两个向量的标量积(点积) ,一个简单的算法恰好使用 2 n -1 次乘法和加法,这使得它的复杂度为O(n)。用于将两个大小各异的矩阵相乘 n × n ,一个简单的算法计算第一个矩阵的每一行与第二个矩阵的每一列的点积需要精确 2 n - 1 n 2 操作,所以这将是 n 3 。矩阵求逆通常也是 n 3

对于任何对算法渐近分析感兴趣的人,很快就会发现它比运算计数稍微复杂一些,因为有时我们必须对输入的大小进行估计或平均(n 代表什么?),如何计算算法中的运算(通过每一行代码?),我们不能忽视这样一个事实:对大数进行计算比对小数进行运算更消耗时间和内存。最后,我们更喜欢在多项式时间或更短的时间内运行的算法,而不是在指数时间或更长的时间内运行的算法。让我们用一个非常简单的例子来演示。

This is where we attempt to estimate the number of operations that the algorithm requires and quantify it relative to the size of the problem. We usually use the big O notation.

Big O() Notation

A function g(n) is O(f(n)) when g ( n ) c f ( n ) for some constant c, and for all n n 0 .

For example, 2n+1 is O(n), 5 n 3 - 7 n 2 + 1 is O ( n 3 ) , n 2 2 n - 55 n 100 is O ( n 2 2 n ) , and 15 n l o g ( n ) - 5 n is O ( n l o g ( n ) ) .

Do not forget the constant asymptotics case O(1), where the operation count of an algorithm is independent of the size of the problem (awesome thing, because this means that it scales without any worries of enormous problems).

For some algorithms, we can count the exact number of operations; for example, to compute the scalar product (dot product) of two vectors of length n, a simple algorithm uses exactly 2n-1 multiplications and additions, which makes it O(n). For multiplying two matrices each of size n × n , a simple algorithm computing the dot product of each row from the first matrix with each column from the second matrix requires exactly ( 2 n - 1 ) n 2 operations, so this will be O ( n 3 ) . Matrix inversion is also usually O ( n 3 ) .

For anyone interested in asymptotic analysis for algorithms, it quickly becomes obvious that it is slightly more involved than operation counts, because sometimes we have to make estimates or averages on the size of input (what does n stand for?), how to count the operations in an algorithm (by each line of code?), and we cannot ignore the fact that doing computations on large numbers is more consuming in time and memory than doing operations on smaller numbers. Finally, we prefer algorithms that run in polynomial time or less and not in exponential time or more. Let’s demonstrate with a very simple example.

习惯在精确领域而不是近似渐近领域进行操作的人可能会对这种讨论感到困扰,因为有时,一些高阶算法比低阶算法更适合较小规模的问题。例如,假设O(n)算法的精确操作次数为 20 n -99,而 O(n) 算法的精确操作次数为 n 2 n 2 + 1 ,那么渐近地(或对于足够大的n),O(n)算法确实优于 n 2 ,但如果n小于 10,则情况并非如此,因为在这种情况下, n 2 + 1 < 20 n - 99 。这对于足够小的问题来说是可以的,但对于更大的问题就不行了。

A person who is used to operating in exact realms and not in approximate or asymptotic realms might be troubled by this discussion, because sometimes, some higher-order algorithms are better for smaller size problems than lower order ones. For example, suppose the exact operation count of an O(n) algorithm is 20n-99, and that of an O ( n 2 ) is n 2 + 1 , then it is true that asymptotically (or for large enough n), the O(n) algorithm is better than the O ( n 2 ) , but that is not the case if n is smaller than 10, because in this case, n 2 + 1 < 20 n - 99 . This is OK for small enough problems, but never for larger problems.

本章中我们将很快提到的优化方法是单纯形法和内点用于线性优化(目标函数和约束都是线性的优化)。内点法是多项式时间算法,而单纯形法是指数时间算法,因此您会期望每个人都会使用更便宜的内点并放弃单纯形法,但事实并非如此。单纯形法(和对偶单纯形法)仍然广泛用于线性优化而不是内点优化,因为指数时间是最坏情况,而大多数应用都不是最坏情况。此外,算法之间通常在每次迭代的计算量、所需的迭代次数、更好的起点的效果、算法是否收敛或在接近结束时需要额外的帮助、这种额外的帮助需要多少计算量方面进行权衡。需要,该算法可以利用并行处理吗?因此,用于线性优化的计算机软件包可以有效地实现单纯形法和内点法(以及许多其他算法)。最终,我们选择最适合我们的用例。

Two optimization methods that we will soon mention in this chapter are the simplex method and the interior point method for linear optimization (optimization where both the objective function and the constraints are linear). The interior point method is a polynomial time algorithm and the simplex method is exponential time, so you would expect that everyone would use the cheaper interior point and abandon simplex, but this is not true. The simplex method (and the dual simplex) is still widely used for linear optimization instead of interior point because that exponential time is a worst-case scenario and most applications are not worst-case. Moreover, there are usually trade-offs between algorithms in terms of computational effort per iteration, number of iterations required, the effect of better starting points, whether the algorithm converges or will need extra help near the end, how much computation this extra help would require, and can the algorithm take advantage of parallel processing? For this reason, computer packages for linear optimization have efficient implementations of both the simplex and the interior point methods (and many other algorithms as well). Ultimately, we choose what works best for our use cases.

优化:运筹学的核心

Optimization: The Heart of Operations Research

我们发现我们回到优化的方式。在机器学习中,优化是指最小化学习确定性函数的模型的损失函数,或最大化学习概率分布的模型的似然函数。我们不想要一个与数据完全匹配的解决方案,因为这不能很好地推广到看不见的数据。因此出现了正则化方法、早期停止方法等。在机器学习中,我们使用可用的数据来学习模型:作为数据源的确定性函数或概率分布(数据生成规则或过程),然后我们使用这个学习到的函数或分布进行推理。优化只是整个过程中的第一步:最小化损失函数,无论有或没有正则化项。机器学习中出现的损失函数通常是可微的、非线性的,并且优化是无约束的。我们可以添加约束来引导流程进入某些所需的领域,具体取决于应用程序。

We found our way back to optimization. In machine learning, optimization is about minimizing the loss function for models that learn deterministic functions, or maximizing the likelihood function for models that learn probability distributions. We do not want a solution that matches the data exactly, since that would not generalize well to unseen data. Hence the regularization methods, early stopping, and others. In machine learning, we use the available data to learn the model: the deterministic function or the probability distribution that is the source of the data (the data-generating rule or process), then we use this learned function or distribution to make inferences. Optimization is just one step along the way: minimize the loss function, with or without regularization terms. The loss functions that appear in machine learning are usually differentiable and nonlinear, and the optimization is unconstrained. We can add constraints to guide the process into some desired realm, depending on the application.

优化方法可以包括计算目标函数的导数 F X ,比如机器学习最喜欢的梯度下降(随机梯度下降,ADAM等),也可以不是。有些优化算法是无导数的。当目标函数不可微(例如带有角点的函数)或目标函数的公式甚至不可用时,这些非常有用。无导数优化方法的示例包括贝叶斯搜索、布谷鸟搜索和遗传算法。

Methods for optimization can either include computing derivatives of the objective function f ( x ) , such as machine learning’s favorite gradient descent (stochastic gradient descent, ADAM, etc.), or not. There are optimization algorithms that are derivative free. These are very useful when the objective function is not differentiable (such as functions with corners) or when the formula of the objective function is not even available. Examples of derivative-free optimization methods include Bayesian search, Cuckoo search, and genetic algorithms.

自第二次世界大战以来,优化(特别是线性优化)一直是运筹学的核心,当时开发了诸如单纯形法等线性优化方法来帮助军事后勤和作战。与往常一样,目标是在给定某些约束(预算、截止日期、容量等)的情况下最小化目标函数(成本、距离、时间等):

Optimization, in particular linear optimization, has been at the heart of operations research since the Second World War, when methods for linear optimization such as the simplex method were developed to aid in military logistics and operations. The goal, as always, is to minimize an objective function (cost, distance, time, etc.) given certain constraints (budget, deadlines, capacity, etc.):

n CnstrAnts F X

为了学习运筹学的优化,典型的课程通常会花费大量时间在线性优化、整数优化和网络(图)优化上,因为许多现实生活中的物流和资源分配问题完全适合这些公式。要成为出色的运筹学家,我们需要学习:

To learn optimization for operations research, a typical course usually spends a lot of time on linear optimization, integer optimization, and optimization on networks (graphs), since many real-life logistics and resource allocation problems fit perfectly into these formulations. To become thriving operations researchers, we need to learn:

线性优化
Linear optimization

这是其中目标函数和约束都是线性的。在这里我们学习单纯形法、对偶法、拉格朗日松弛法和灵敏度分析。在线性问题中,我们世界的边界是平坦的,由直线、平面和超平面组成。这种(超)多边形几何或多面体通常具有作为优化器候选的角点,因此我们设计系统的方法来筛选这些点并测试它们的最优性(这就是单纯形法和对偶单纯形法所做的) 。

This is where both the objective function and the constraints are linear. Here we learn about the simplex method, duality, Lagrangian relaxation, and sensitivity analysis. In linear problems, the boundaries of our world are flat, made of lines, planes, and hyperplanes. This (hyper)polygonal geometry, or polyhedron, usually has corner points that are candidates for being optimizers, so we devise systematic ways to sift through these points and test them for optimality (this is what the simplex method and the dual simplex method do).

内点法
Interior point methods

为了单纯形法无法解决的大规模线性优化问题。简而言之,单纯形方法围绕可行搜索空间的边界(多面体的边缘),检查它到达的每个角的最优性,然后移动到边界处的另一个角。另一方面,内点法遍历可行搜索空间的内部,从可行搜索空间的内部(而不是从边界)到达最佳角点。

For large-scale linear optimization problems that could be beyond the reach of the simplex method. In short, the simplex method goes around the boundary of the feasible search space (the edges of the polyhedron), checks each corner it arrives at for optimality, then moves to another corner at the boundary. The interior point method, on the other hand, goes through the interior of the feasible search space, arriving at an optimal corner from the inside of the feasible search space, as opposed to from the boundary.

整数规划
Integer programming

优化其中优化向量的条目必须全部是整数。有时它们只能是零或一(是否将卡车发送到俄亥俄州的仓库)。背包问题是一个非常简单的原型示例。在这里我们学习大整数规划问题的分支定界法。

Optimization where the entries of the optimizing vector must all be integers. Sometimes they can only be zero or one (send the truck to the warehouse in Ohio or not). The knapsack problem is a very simple prototype example. Here we learn about the branch and bound method for large integer programming problems.

网络优化
Optimization on networks

我们可以将许多网络问题重新表述为线性优化问题,其中单纯形法及其专门版本可以发挥作用,但最好利用网络结构并利用图论的有用结果,例如最大流最小割定理,以获得更高效的算法。网络上的许多问题都可以归结为以下之一的优化:网络上的最短路径(从一个节点到另一个节点的路径,距离最小或成本最小),网络的最小生成树(这对于优化网络设计非常有用)、最大流量(从起点到目的地或从源到汇)、最小成本流量、多商品流量或旅行推销员(找到仅通过所有网络节点一次的最小成本[或距离或重量]循环路线[哈密尔顿电路])。

We can reformulate many network problems as linear optimization problems where the simplex methods and specialized versions of it work, but it is much better to exploit the network structure and tap into useful results from graph theory, such as the max-flow min-cut theorem, for more efficient algorithms. Many problems on networks boil down to optimizing for one of the following: shortest path on the network (path from one node to another with minimum distance or minimum cost), minimum spanning tree of a network (this is great for optimizing the design of networks), maximum flow (from origin to destination or from source to sink), minimum cost flow, multicommodity flow, or traveling salesman (finding the minimum cost [or distance or weight] cyclic route that passes through all the network’s nodes only once [Hamiltonian circuit]).

非线性优化
Nonlinear optimization

目标函数和/或约束是非线性的。本书中反复出现的一个例子是最小化机器学习模型的非线性损失函数。这些总是非线性的,我们通常使用梯度下降型算法。对于较小的问题,我们可以使用牛顿型算法(二阶导数)。在运筹学中,目标函数和/或约束中的非线性可能会出现,因为将货物从一个地点运送到另一个地点的成本可能不固定(例如,取决于距离或数量),或者通过网络的流量可能包括损失或收益。我们熟知的一种特殊类型的非线性优化是具有线性约束的二次优化。这出现在电路网络方程和结构弹性理论等应用中,在这些应用中我们考虑结构中的位移、应力、应变和力平衡。想想寻找二次函数的最小值是多么容易 F X = s X 2 ,其中s是正常数。这种轻松性很好地转化为更高的维度,我们的目标函数看起来像 F X = X t S X ,其中S是正半定矩阵,对于高维起到与一维正常数相同的作用。在这里,我们甚至可以利用对偶理论,类似于线性优化情况。在优化中,当我们失去线性时,我们希望我们的函数是二次的并且我们的约束是线性的。当我们失去这一点时,我们希望我们的函数和/或可行集是凸的。当我们失去凸性时,我们只能依靠自己,希望我们的方法不会陷入高维景观的局部最小值,并以某种方式找到最佳方法解决方案。

The objective function and/or the constraints are nonlinear. One recurring example throughout this book is minimizing nonlinear loss functions for machine learning models. These are always nonlinear, and we commonly use gradient descent-type algorithms. For smaller problems we can use Newton type algorithms (second derivatives). In operations research, nonlinearities in the objective function and/or constraints might appear because the cost of shipping goods from one location to another might not be fixed (for example, depends on the distance or on the quantity), or a flow through a network might include losses or gains. A special type of nonlinear optimization that we know a lot about is quadratic optimization with linear constraints. This appears in applications such as network equations for electric circuits, and elasticity theory for structures where we consider displacements, stresses, strains, and balance of forces in a structure. Think of how easy it is to find the minimum of the quadratic function f ( x ) = s x 2 , where s is a positive constant. This ease translates nicely to higher dimensions, where our objective function looks like f ( x ) = x t S x , where S is a positive semidefinite matrix, playing the same role for high dimensions as a positive constant for one dimension. Here we even have duality theory that we can take advantage of, similar to the linear optimization case. In optimization, when we lose linearity, we hope our functions are quadratic and our constraints are linear. When we lose that, we hope our functions and/or feasible set are convex. When we lose convexity, we are on our own, hoping our methods don’t get stuck at the local minima of high-dimensional landscapes, and somehow find their way to optimal solutions.

动态规划和马尔可夫决策过程
Dynamic programming and Markov decision processes

动态规划要做的事对于具有多个阶段的项目,必须在每个阶段做出决策,并且每个决策都会产生一些直接成本。每个阶段的决策都与当前状态以及过渡到下一个状态的策略有关(通过确定性函数或概率的最小化选择下一个状态)。动态编程就是设计有效的方法(通常是递归方法)来找到实现特定目标的相互关联决策的最佳序列。这个想法是避免必须列出决策过程每个阶段的所有选项,然后选择最佳决策组合。对于具有许多决策阶段(每个阶段都有许多状态)的问题来说,这种详尽的搜索是极其昂贵的。现在,如果从一个阶段到另一阶段的过渡策略是概率性的而不是确定性的,并且如果决策过程的阶段继续无限期地重复,这意味着如果项目有无限多个阶段,那么我们就有一个马尔可夫决策过程(或马尔可夫链)就在我们手上。这是一个随着时间以概率方式演变的过程。马尔可夫决策过程的一个非常特殊的属性是,涉及过程未来如何演变的概率独立于过去的事件,并且仅取决于系统的当前状态。离散时间和连续时间马尔可夫链都对重要系统进行建模,例如排队系统、最大限度减少汽车等待时间的动态交通灯控制以及灵活的呼叫中心人员配置。重要的数学对象是转移矩阵,我们求解稳态概率。他们最终必须计算转移矩阵的特征空间。

Dynamic programming has to do with projects with multiple stages, where decisions have to be made at each stage, and each decision generates some immediate cost. The decision at each stage has to do with the current state, together with a policy to transition to the next state (choose the next state via a minimization of a deterministic function or a probability). Dynamic programming is all about devising efficient ways, usually recursive methods, to finding the optimal sequence of interrelated decisions to fulfill a certain goal. The idea is to avoid having to list all the options for each stage of the decision process, then selecting the best combination of decisions. Such an exhaustive search is extremely expensive for problems with many decision stages, each having many states. Now if the transition policy from one stage to the other is probabilistic rather than deterministic, and if the stages of the decision process continue to recur indefinitely, meaning if the project has an infinite number of stages, then we have a Markov decision process (or Markov chain) on our hands. This is a process that evolves over time in a probabilistic manner. A very special property of a Markov decision process is that the probabilities involving how the process evolves in the future are independent of past events, and depend only on the system’s current state. Both discrete time and continuous time Markov chains model important systems, such as queuing systems, dynamic traffic light control to minimize car waiting time, and flexible call center staffing. The important math objects are the transition matrices, and we solve for the steady state probabilities. They end up having to compute the eigenspace of the transition matrix.

随机算法
Stochastic algorithms

动态规划概率转移策略和马尔可夫链都是随机算法的例子。图上的随机梯度下降和随机游走也是如此。任何涉及随机性元素的算法都是随机的。数学过渡到概率、期望、稳态、收敛等语言。随机算法和过程分析出现的另一个例子排队论,例如在医院急诊室或船舶维修场排队。这建立在客户到达时间和服务设施的服务时间的概率分布的基础上。

Dynamic programming with probabilistic transition policy and Markov chain are both examples of stochastic algorithms. So are stochastic gradient descent and random walks on graphs. Any algorithm that involves an element of randomness is stochastic. The mathematics transitions to the language of probabilities, expectations, stationary states, convergence, etc. Another example where stochastic algorithms and analysis of processes appear is queuing theory, such as queues at a hospital emergency room or at a ship maintenance yard. This builds on probability distributions of arrival times of customers and service times by the service facility.

元启发法
Metaheuristics

对于很多优化问题,找到最优解可能不切实际,所以我们(仍然需要做出决策的人)诉诸启发式方法来寻找答案(我不会称其为解决方案),这不一定是最优的,但足以满足手头的问题。元启发法是通用的解决方法,它为开发启发式方法以适应某些问题系列提供策略指南和通用框架。我们不能保证启发式方法的答案是最优的,但启发式确实可以加快寻找满意解决方案的过程,而最佳解决方案的计算成本太高或无法实现。还有可满足性的主题。由于运筹学中的问题几乎总是受到约束,因此自然的问题是:约束是否可以满足?意思是,可行集非空吗?一些运筹学问题被重新表述为可满足性问题。

For many optimization problems, finding the optimal solution might be impractical, so we (who still need to make decisions) resort to heuristic methods to find an answer (I will not call it a solution), which is not necessarily optimal but is good enough for the problem at hand. Metaheuristics are general solution methods that provide strategy guidelines and general frameworks for developing heuristic methods to fit certain families of problems. We cannot guarantee the optimality of an answer from a heuristic method, but heuristics do speed up the process of finding satisfactory solutions where optimal solutions are too expensive to compute or are out of reach. There is also the topic of satisfiability. Since problems in operations research are almost always constrained, the natural question is: are the constraints satisfiable? Meaning, is the feasible set nonempty? Some operations research problems get reformulated as satisfiability problems.

在现实问题中,运筹学部门的大部分工作是以适合这些优化框架之一的方式制定其具体用例和目标。在这里,重要的是要识别特殊结构(例如所涉及矩阵中的稀疏性)或子结构,我们可以利用它们来实现更有效的算法。这对于复杂且大规模的系统至关重要。

In real-world problems, a big part of the work of operations research departments is formulating their specific use cases and objectives in a way that can fit into one of these optimization frameworks. Here it is important to recognize special structures (such as sparsity in the involved matrices) or substructures that we can exploit for more efficient algorithms. This is crucial for complicated and large-scale systems.

关于优化的思考

Thinking About Optimization

什么时候我们遇到一个数学优化问题:

When we encounter an optimization problem in mathematics:

分钟 X ε一些可行的 F X

其中可行集由向量的一些约束定义 X 必须满足(或者可能完全不受约束),我们通常会停下来集思广益:

where the feasible set is defined by some constraints that the vector x must satisfy (or it could be totally unconstrained), we usually pause and brainstorm a little:

  • F X 线性?

  • Is f ( x ) linear?

  • F X 凸?下界?

  • Is f ( x ) convex? Bounded below?

  • 最小值是有限的还是 - 无穷大

  • Is the minimum value finite, or does it - ?

  • 可行集非空吗?意义就在那里 X 这真的满足约束条件吗?

  • Is the feasible set nonempty? Meaning are there x ’s that actually satisfy the constraints?

  • 可行集是凸集吗?

  • Is the feasible set convex?

  • 是否存在最小化器?

  • Does a minimizer exist?

  • 最小化器是独一无二的,还是还有其他的?

  • Is a minimizer unique, or are there others?

  • 我们如何找到最小化器?

  • How do we find the minimizer?

  • 最小值是多少?

  • What is the value of the minimum?

  • 如果我们的约束或目标函数发生变化,最小化器和最小值的值会发生多少变化?

  • How much does the minimizer and the value of the minimum change if something changes in our constraints or in our objective function?

根据手头问题的类型,我们也许能够独立回答这些问题,这意味着有时我们只能回答其中一些问题,而不能回答其他问题。这很好,因为有关优化器和最优值的任何信息都是有价值的。

Depending on the type of problem at hand, we might be able to answer these questions independently, meaning sometimes we can answer only some of them and not others. This is fine because any information about the optimizer and the value of the optimum is valuable.

让我们探讨常见的优化类型问题。

Let’s explore common types of optimization problems.

优化:有限维度,无约束

Optimization: Finite Dimensions, Unconstrained

这是类似于我们在微积分课程中所做的优化,以及我们在训练机器学习模型时所做的优化,最小化损失函数。目标函数 F X 是可微的:

This is similar to the optimization that we do in calculus classes, and the optimization we do when training a machine learning model, minimizing the loss function. The objective function f ( x ) is differentiable:

分钟 X ε d F X

在无约束和可微的优化中,最小化器 X * 满足 F X = 0 。此外,Hessian(二阶导数矩阵)在以下位置处是半正定的: X * 。在讨论机器学习的优化时,我们选择了随机梯度下降及其针对高维问题的变体。对于较小的问题,牛顿型(使用二阶导数,而不仅仅是一阶导数)方法也有效。对于极少数问题,例如线性回归的均方误差损失函数,我们可以获得解析解。我们可以获得解析解的示例通常是精心构建的(例如我们微积分书籍中的所有示例),并且维度非常低。

In unconstrained and differentiable optimization, the minimizer x * satisfies f ( x ) = 0 . Moreover, the Hessian (matrix of second derivatives) is positive semidefinite at x * . When discussing optimization for machine learning, we settled on stochastic gradient descent and its variants for very high-dimensional problems. For smaller problems, Newton-type (working with second derivatives, not only first ones) methods work as well. For very few problems, such as the mean squared error loss function for linear regression, we can get analytical solutions. Examples where we can get analytical solutions are usually carefully constructed (such as all the examples in our calculus books), and very low dimensional.

优化:有限维、约束拉格朗日乘子

Optimization: Finite Dimensions, Constrained Lagrange Multipliers

让我们考虑一下我们只有一个约束的情况 G X = 。这很好地解释了我们需要什么。最小化问题如下所示:

Let’s think of the case where we only have one constraint g ( x ) = b . This explains what we need rather well. The minimization problem looks like:

分钟 GX =Xε d F X

如果 F X G X 是可微分函数 d ,我们可以引入拉格朗日乘子(1797 年的一种方法)将我们的问题改为无约束问题,但维度更高(对应于我们在优化问题中引入的新拉格朗日乘子)。没有什么是免费的。在这种情况下,我们将约束的倍数添加到目标函数中,然后最小化,这意味着寻找梯度为零的点。无约束问题的新目标函数称为拉格朗日函数,它是决策向量的函数 X 和新变量 λ ,我们将其乘以我们的约束,称为拉格朗日乘数

If f ( x ) and g ( x ) are differentiable functions from d , we can introduce Lagrange multipliers (a method from 1797) to change our problem into an unconstrained one, but in higher dimensions (corresponding to the new Lagrange multipliers that we introduce to the optimization problem). Nothing is free. In this case, we add a multiple of our constraint to the objective function, then minimize, which means look for the points where the gradient is zero. The new objective function for the unconstrained problem is called the Lagrangian, and it is a function of both the decision vector x and the new variable λ , which we multiplied by our constraint, called the Lagrange multiplier:

X ; λ = F X + λ - G X

如果我们有多个约束,比如五个约束,那么我们为每个约束引入一个拉格朗日乘子,为我们的优化问题添加五个额外维度,将其从受约束状态转移到无约束状态。

If we have more than one constraint, say five constraints, then we introduce a Lagrange multiplier for each, adding five extra dimensions to our optimization problem to move it from the constrained regime to the unconstrained one.

优化器 X * , λ * 无约束问题的必须满足: X ; λ = 0 。我们以与解决一般无约束问题相同的方式来寻找它(参见前面的案例)。这 X * X * , λ * 是我们最初寻找的约束问题的解。这意味着它是由约束定义的超曲面上的点 G X * = 其中f的值最小。

The optimizer ( x * , λ * ) of the unconstrained problem must satisfy: ( x ; λ ) = 0 . We go about finding it the same way we go about general unconstrained problems (see the previous case). The x * from ( x * , λ * ) is the solution of the constrained problem that we were originally searching for. This means that it is the point on the hypersurface defined by the constraint g ( x * ) = b where the value of f is smallest.

如果问题具有我们可以利用的特殊结构,例如如果f是二次的并且约束g是线性的,或者如果fg都是线性的,那么我们有更方便的方法来进行这种约束优化,如果我们决定使用拉格朗日乘子(引入对偶性)而不使用拉格朗日乘子。幸运的是,结构简单的优化问题得到了很好的研究,不仅因为它们使数学和计算变得更容易,而且因为它们一直出现在科学和现实生活中的应用中,这使我的理论具有一定的可信度,即自然是比数学家想象的要简单。我们将在对偶性部分重新讨论用于约束问题的拉格朗日乘子,其中我们仅关注完全线性问题或具有线性约束的二次问题。

If the problem has a special structure that we can exploit, such as if f is quadratic and the constraint g is linear, or if both f and g are linear, then we have more convenient methods to go about this constrained optimization, both if we decide to use Lagrange multipliers (which introduce duality) and without using Lagrange multipliers. Luckily, optimization problems with simple structures are very well studied, not only because they make the mathematics and computations easier, but also because they appear all the time in science and in real-life applications, which gives some credibility to my theory that nature is simpler than mathematicians think it is. We will revisit Lagrange multipliers for constrained problems in the section on duality, where we focus solely on fully linear problems or quadratic problems with linear constraints.

拉格朗日乘数的含义

The meaning of Lagrange multipliers

我们应该永久记住的一件好事是拉格朗日乘数 λ 不是一些无价值的辅助标量,可以帮助我们将受约束的问题转变为无约束的问题。它对于敏感性分析、金融和运筹学应用以及对偶理论(它们都是相互关联的)非常有帮助。在数学上,通过观察拉格朗日公式 X ; λ = F X + λ - G X , λ 是拉格朗日函数的变化率b,如果我们被允许改变b(应用程序中的约束值,我们关心推动或放松约束的效果)。那是:

The nice thing that we should make a permanent mental note of is that the Lagrange multiplier λ is not some worthless auxiliary scalar that helps us change a constrained problem into an unconstrained one. It has a meaning that is very helpful for sensitivity analysis, for finance and operations research applications, and for duality theory (which are all related to each other). Mathematically, by observing the formula of the Lagrangian ( x ; λ ) = f ( x ) + λ ( b - g ( x ) ) , λ is the rate of change of the Lagrangian as a function of b, if we were allowed to vary b (the value of the constraint in applications we care about the effect of pushing or relaxing the constraints). That is:

X ;λ, = FX +λ-GX = FX + λ-GX = 0 + λ = λ

此外,我们可以解释最优值 λ * 对应优化器 X * 作为b对目标函数的最佳可达到值的边际效应 F X * 。因此,如果 λ * = 2 1 ,那么将b增加1 个单位(将约束推高 1 个单位)将使f的最优值增加2.1 个单位。这对于金融和运筹学的应用来说是非常有价值的信息。让我们看看为什么会出现这种情况。我们想证明:

Moreover, we can interpret the optimal value λ * corresponding to the optimizer x * as the marginal effect of b on the optimal attainable value of the objective function f ( x * ) . Hence, if λ * = 2 . 1 , then increasing b by one unit (pushing the constraint by one unit) will increase the optimal value of f by 2.1 units. This is very valuable information for applications in finance and operations research. Let’s see why this is the case. We want to prove that:

dFX * d = λ *

请注意,优化器会发生两件事 X * ,当我们将拉格朗日函数的梯度设置为零时,我们得到: F X * = λ * G X * , 和 G X * = 。使用这些信息和导数的链式法则(回到你的微积分书并掌握链式法则,我们一直在使用它),我们现在有:

Note that two things happen at the optimizer x * ( b ) , which we get when we set the gradient of the Lagrangian to zero: f ( x * ( b ) ) = λ * g ( x * ( b ) ) , and g ( x * ( b ) ) = b . Using this information and the chain rule for derivatives (go back to your calculus book and master the chain rule, we use it all the time), we now have:

dFX * d = F X * dX * d = λ * G X * dX * d = λ * dGX * d = λ * d d = λ * × 1 = λ *

换句话说,拉格朗日乘数 λ * 是由于放松相应约束而导致的最优成本(目标函数的值)的变化率。在经济学中, λ * 叫做相对于约束的边际成本,或影子价格。当我们在本章后面讨论对偶性时,我们使用字母p来表示对偶性的决策变量这个价格原因的问题。

In other words, the Lagrange multiplier λ * is the rate of change of the optimal cost (value of the objective function) due to the relaxation of the corresponding constraint. In economics, λ * is called the marginal cost with respect to the constraint, or the shadow price. When we discuss duality later in this chapter, we use the letters p for the decision variables of the dual problem for this price reason.

优化:无限维、变分法

Optimization: Infinite Dimensions, Calculus of Variations

变分法域是一个优化域,但我们不是在有限维空间中搜索优化,而是在无限维空间中搜索优化函数。

The field of calculus of variations is an optimization field, but instead of searching for optimizing points in finite dimensional spaces, we are searching for optimizing functions in infinite dimensional spaces.

有限维函数 d 好像 F X = F X 1 , X 2 , , X d ,其梯度(这对于优化始终很重要)是:

A finite dimensional function from d looks like f ( x ) = f ( x 1 , x 2 , , x d ) , and its gradient (which is always important for optimization) is:

F = F X 1 F X 2 F X d

它的方向导数测量f在某个向量方向上的变化或其变化。 n :

Its directional derivative measures the change of f, or its variation, in the direction of some vector n :

FX n = H0 FX +Hn -FX H = F n

现在如果我们允许 X 取决于时间,那么我们有:

Now if we allow x to depend on time, then we have:

F ' X t = dFX t dt = F dX t dt = F X ' t

在计算无限维泛函变化时,该表达式会派上用场。泛函是一个输入为函数、输出为实数的函数。因此,对于无限维泛函 : 一些 功能 空间 将某个函数空间中的函数u映射为实数。

That expression comes in handy when calculating variations of infinite dimensional functionals. A functional is a function whose input is a function and whose output is a real number. Thus, for an infinite dimensional functional E ( u ) : some function space maps a function u that lives in some function space to a real number.

泛函的一个例子是连续函数在区间 [0,1] 上的积分。另一个流行的例子是这个积分:

One example of a functional is the integral of a continuous function on the interval [0,1]. Another popular example is this integral:

X = 0 1 X 2 + ' X 2 d X

例如,该函数映射函数 X 2 到数字 5/3,这是积分的值。

For instance, this functional maps the function x 2 to the number 5/3, which is the value of the integral.

有限维时间导数表达式(当我们允许依赖于时间时)的模拟现在是:

The analog for that finite dimensional time derivative expression (when we allow dependence on time), is now:

dt dt = dt dt = ' t = , ' t 一些功能空间

这非常有用。它通常可以帮助我们精确定位无限维梯度 ,如果在我们的计算中我们设法分离出数量乘以 ' t 。这些计算通常涉及积分表达式,并且乘积通常是在某种无限维意义上定义的,这也涉及积分表达式。这就是为什么我们使用乘积符号 , ' t 而不是通常的点。

This is very useful. It usually helps us pinpoint the infinite dimensional gradient E , if in our calculations we manage to isolate the quantity multiplied by u ' ( t ) . These calculations usually involve integral expressions, and the product is usually defined in some infinite dimensional sense, which also involves integral expressions. This is why we use the product notation of E , u ' ( t ) instead of the usual dot.

我们将通过一个示例来演示一种存在于无限维空间中的函数的产品 L 2 D ,其中包含具有有限的所有函数u(x) D |X| 2 d X ,以及泛函的梯度 L 2 D 感觉。

We will work through one example to demonstrate a product for functions that live in the infinite dimensional space L 2 ( D ) , which contains all the functions u(x) with a finite D |u(x)| 2 d x , and the gradient of a functional in the L 2 ( D ) sense.

优化函数与优化泛函之间的类比

Analogy between optimizing functions and optimizing functionals

记住有限维和无限维公式之间的类比是有好处的,因为数学中的所有内容通常都紧密地联系在一起。同时我们必须谨慎,因为向无限的过渡是巨大的,许多有限维度的属性和方法无法实现。

It is good to keep in mind the analogy between finite and infinite dimensional formulations, as everything in math usually ties neatly together. At the same time we must be cautious, as the transition to the infinite is massive, and many finite dimensional properties and methods do not make it through.

在有限维度中,优化点满足基于将目标函数(或者在本书中:损失函数、成本函数或效用函数)的梯度设置为零的方程。

In finite dimensions, the optimizing point or points satisfy an equation based on setting the gradient of the objective function (alternatively in this book: loss function, cost function, or utility function) equal to zero.

在无限维中,优化函数或函数满足基于将目标函数的梯度设置为零的微分方程,也就是说,假设我们设法定义函数的梯度。为了找到优化器,我们要么必须求解这个微分方程,称为欧拉-拉格朗日方程,或者遵循泛函景观的一些优化方案。无限维泛函的景观是不可能可视化的,因此讽刺的是,我们最终将其可视化为一维,其中 x 轴代表函数空间u,y 轴代表E(u)

In infinite dimensions, the optimizing function or functions satisfy a differential equation based on setting the gradient of the objective functional equal to zero, that is, given that we somehow manage to define the gradient of a functional. To find the optimizer, we either have to solve this differential equation, called the Euler-Lagrange equation, or follow some optimization scheme on the landscape of the functional. It is impossible to visualize the landscape of an infinite dimensional functional, so ironically, we end up visualizing it as only one-dimensional, where the x-axis represents the function space u, and the y-axis represents E(u).

我们在有限维机器学习中广泛使用的梯度下降就是优化方案的一个例子。梯度下降的相同想法适用于无限维度:遵循最陡增加(如果最大化)或减少(如果最小化)的方向。

The gradient descent that we use extensively in finite dimensional machine learning is an example of an optimization scheme. The same idea of the gradient descent applies to infinite dimensions: follow the direction of the steepest increase (if maximizing) or decrease (if minimizing).

当然,我们需要定义梯度对于无限维空间上定义的泛函意味着什么。事实证明,我们可以通过多种方式定义梯度,具体取决于所涉及的函数所处的空间(例如所有连续函数的空间、所有具有一个连续导数的函数的空间、平方积分有限的函数的空间) ,以及许多其他)。梯度的含义还是一样的:它衡量的是函数在某种意义上的变化,就像有限维函数的梯度衡量的是函数在某个方向上的变化(变化)一样。

Of course, we need to define what the gradient means for functionals defined on infinite dimensional spaces. It turns out there are many ways we can define gradients, depending on what spaces the involved functions live in (such as space of all continuous functions, space of all functions that have one continuous derivative, space of functions whose integral of their square is finite, and many others). The meaning of the gradient remains the same: it measures the variation of the functional in a certain sense, just like the gradient of a finite dimensional function measures the variation (change) of the function in a certain direction.

如果您对微分方程不感兴趣,可以跳过本节的其余部分,因为本节对于运筹学不是必需的。它的唯一目的是探索向无限维的过渡,并了解当我们最小化包含积分的公式(函数公式)时如何获得微分方程。有关以下示例的更多详细信息,请查看本书 GitHub 页面上有关变分法的 PDF 文件。

You can skip the rest of this section if you are not interested in differential equations, as this section is not essential to operations research. The only purpose of it is to explore transitioning into infinite dimensions, and to see how we can obtain differential equations when we minimize formulas containing integrals (the functional formulas). For more details on the following examples, check the PDF file on calculus of variations on this book’s GitHub page.

示例 1:调和函数、狄利克雷能量和热方程

Example 1: Harmonic functions, the Dirichlet energy, and the heat equation

调和函数是其所有二阶导数之和为零的函数,例如,在二维中: Δ = XX + yy = 0 。示例功能包括 e X y X 2 - y 2 。在现实生活中,当对静电势和电荷密度分布进行建模时,或者对有节奏的运动进行建模时(例如经历有节奏的周期性运动的弦或无限振荡的无摩擦摆),静电学中就会出现这些类型的函数。

A harmonic function is a function whose sum of all of its second derivatives is zero, for example, in two dimensions: Δ u = u xx + u yy = 0 . Example functions include e x sin ( y ) and x 2 - y 2 . In real life, these types of functions appear in electrostatics when modeling electrostatic potentials and charge density distributions, or when modeling rhythmic motions, such as a string undergoing a rhythmic periodic motion, or a frictionless pendulum oscillating indefinitely.

调和函数最小化狄利克雷能量

A harmonic function minimizes the Dirichlet energy

不要试图寻找二阶导数加起来为零的函数(有非常成熟的方法可以做到这一点) Δ = 0 )并满足某些边界条件,还有另一种好方法将调和函数视为能量泛函的极小值,称为狄利克雷能量泛函:

Instead of trying to look for functions whose second derivatives add up to zero (there are very well-established ways to do this for Δ u = 0 ) and satisfying certain boundary conditions, there is another nice way to think about a harmonic function as the minimizer of an energy functional, called the Dirichlet energy functional:

X = D 1 2 |X| 2 d X

这里,u(x)属于保证积分有限的适当函数空间,并且在边界上指定u(x)=h(x) D D .

Here, u(x) belongs to an appropriate function space that guarantees that the integral is finite, and u(x)=h(x) is specified on the boundary D of the domain D.

当我们将这个能量函数的梯度设置为零时,找到它的极小值,我们得到(欧拉-拉格朗日)方程 Δ = 0 u=h(x) D ,这正是调和函数满足的微分方程。让我们看看这是如何发生的:

When we set the gradient of this energy functional equal to zero, to find its minimizer, we get the (Euler-Lagrange) equation Δ u = 0 , u=h(x) on D , which is exactly the differential equation a harmonic function satisfies. Let’s see how this happens:

' X = D 1 2 |X| 2 ' d X = D X ' X d X = - D Δ X ' X d X + D X n ' X d s = - D Δ X ' X d X + D X n ' X d s

我们使用分部积分获得了第三个等式,它将导数从积分中的一个因子移动到另一个因子,并在此过程中获取负号和边界项。边界项包含原始积分中的两个因子,但不包含从一个因子移动到另一个因子的导数。

We obtained the third equality using integration by parts, which moves the derivative from one factor in the integral to the other, picking up a negative sign and a boundary term in the process. The boundary term contains the two factors in the original integral, but without the derivative that was moved from one to the other.

前面的表达式对于任何情况都成立 ' X 在我们的功能空间中;特别是对于那些患有以下疾病的人来说更是如此 ' X = 0 在域的边界上,它消除了边界处的积分项,留下:

The previous expression is true for any u ' ( x ) in our function space; in particular, it will be true for those with u ' ( x ) = 0 on the boundary of the domain, which kills the integral term at the boundary and leaves us with:

' X = - D Δ X ' X d X = -ΔX, ' X L 2 D

我们刚刚在中定义了产品 L 2 D 函数空间是域上函数通常乘积的积分。

We just defined the product in the L 2 ( D ) function space as the integral of the usual product of the functions over the domain.

现在请注意,类似于有限维情况,我们有:

Now note that, analogous to the finite dimensional case, we have:

' X = L 2 D X, ' X L 2 D

比较最后两个表达式,我们注意到狄利克雷能量E(u)的梯度 L 2 D 意义是 - Δ X ,并且对于调和函数来说它恰好为零。

Comparing the last two expressions, we notice that the gradient of the Dirichlet energy E(u) in the L 2 ( D ) sense is - Δ u ( x ) , and it is zero exactly for harmonic functions.

热方程对狄利克雷能量泛函进行梯度下降

The heat equation does gradient descent for the Dirichlet energy functional

这个故事还有更多内容。在自然界中,当一个系统碰巧随时间演化时,第一个问题通常是:是什么推动了演化?回答这个问题的一个巧妙而直观的方法是,系统的演化方式会以最有效的方式减少一些能量。通常很难发现这些能量泛函的公式,但如果我们碰巧发现了一个公式,我们可能会获得博士学位。学位,因为我们做到了,就像发生在我身上的事情一样。

There is a bit more to this story. In nature, when a system happens to be evolving in time, the first question usually is: what’s driving the evolution? A neat and intuitive way to answer this is that the system evolves in a way that decreases some energy in the most efficient way. It is usually hard to discover the formulas for these energy functionals, but if we happen to discover one, we might get a Ph.D. degree because we did, like what happened with me.

一个简单的例子是热方程 t = Δ ,其中u (x,t)=0 D 和一些初始条件u(x,0)=g(x)。该模型模拟了热、烟雾、材料表面上的原子等的扩散。它自然地具有时间依赖性。如果我们跟踪热方程的解u(x,t)(可能代表温度、溶质浓度、房间中的气体等)的解 u(x,t) 随时间的演变,那么我们会注意到我们正在滑动狄利克雷能量泛函 = 1 2 D |X,t| 2 d X ,在最陡下降方向上, L 2 D 有道理,因为正如我们所讨论的:

A simple example is the heat equation u t = Δ u , with u(x,t)=0 on D and some initial condition u(x,0)=g(x). This models the diffusion of heat, smoke, atoms on material surfaces, etc. It has time dependence naturally built into it. If we follow the evolution in time of the solution u(x,t) of the heat equation (which might represent temperature, solute concentration, gas in a room, etc.), then we notice that we are sliding on the landscape of the Dirichlet energy functional E ( u ) = 1 2 D |u(x,t)| 2 d x , in the steepest descent direction, in the L 2 ( D ) sense, because, as we discussed:

t = Δ = - L 2 D

这意味着最初从某个 X , 0 = G X ,在狄利克雷能量的无限维景观上,达到狄利克雷能量(即调和函数)最小值的最快方法是通过求解热方程:沿着初始 g(x) 的演化路径随着时间的推移。

This means that starting initially at some u ( x , 0 ) = g ( x ) , the fastest way to arrive to the minimizer of the Dirichlet energy, which is the harmonic function, on the infinite dimensional landscape of the Dirichlet energy, is through solving the heat equation: follow the path of the initial g(x) as it evolves with time.

从这个意义上说,热方程提供了一条可行的途径来制定最小化方案问题:

In this sense, the heat equation gives a viable route to formulate a minimizing scheme for the problem:

分钟 =0D 1 2 D |X,t| 2 d X

示例 2:两点之间的最短路径是沿着连接它们的直线

Example 2: The shortest path between two points is along the straight line connecting them

中两点之间的最短路径 2 是沿着连接它们的直线。为此,我们最小化连接两点的曲线的弧长 X 1 , y 1 X 2 , y 2 ,即:

The shortest path between two points in 2 is along the straight line connecting them. To do this, we minimize the arc length of the curve connecting two points ( x 1 , y 1 ) and ( x 2 , y 2 ) , namely:

分钟 yX 1 =y 1 yX 2 =y 2 X 1 X 2 1 + y ' X 2 d X

与前面的例子类似,这个问题是一个泛函的最小化问题,其中包含一些函数及其导数的积分表达式。

Similar to the previous example, this problem is a minimization problem of a functional, which contains integral expressions of some functions and their derivatives.

为了解决这个问题,我们通过将函数的梯度设置为零来编写欧拉-拉格朗日方程。这导致我们得到最小化函数y(x)=mx+b,其中mb分别是连接两个给定点的直线的斜率和y截距。此示例的详细信息位于本书GitHub 页面上有关变分法的 PDF 文件中。

To solve this, we write the Euler-Lagrange equation by setting the gradient of the functional equal to zero. This leads us to the minimizing function y(x)=mx+b, where m and b are, respectively, the slope and the y-intercept of the straight line connecting the two given points. The details of this example are in the PDF file on calculus of variations on this book’s GitHub page.

变分法的其他介绍性例子

Other introductory examples to the calculus of variations

变分法的其他介绍性示例,我们可以通过最小化适当的能量泛函(通过变分原理)来解决,包括最小表面问题和等周问题问题。

Other introductory examples to calculus of variations, which we can solve via minimizing an appropriate energy functional (via a variational principle) include the minimal surface problem and the isoperimetric problem.

网络优化

Optimization on Networks

我想了在线性优化的单纯形法之前先从网络优化开始,因为尽管自然界和运筹学中存在丰富的网络结构,但更多的人习惯于用代数形式(方程和函数)来思考,而不是用图或网络结构思考应用程序。我们需要熟悉图模型。网络结构的优化问题本质上往往是组合问题, n ,这可不是好事,所以我们需要某种算法来绕过这个问题并有效地筛选搜索空间。(请记住,问题的顺序通常是最坏的情况,在最坏的情况下,我们可以用近似解决方案来满足自己,)

I wanted to start with optimization on networks before the simplex method for linear optimization because more people are used to thinking in terms of algebraic forms (equations and functions) than in terms of graph or network structures, despite the abundance of network structures in nature and operations research applications. We need to become very comfortable with graph models. Optimization problems on network structures tend to be combinatorial in nature, O ( n ! ) , which is no bueno, so we need algorithms that somehow circumvent this and efficiently sift through the search space. (Remember, the order of a problem is usually a worst-case scenario, and in worst cases we suffice ourselves with approximate solutions,)

我们讨论典型的网络问题,这些问题恰好捕获了各种现实生活中的应用程序。旅行商问题是最古老、最著名的问题之一,所以我们从这里开始。我们生活在一个拥有开源软件包和云计算资源的时代,其中包括解决本章提到的所有问题的强大算法,因此在本节中我们重点关注了解网络问题的类型及其应用程序,而不是为解决这些问题而设计的算法。

We discuss typical network problems, which happen to capture a wide variety of real-life applications. The traveling salesman problem is one of the oldest and most famous, so we start there. We live in an age where we have open source software packages and cloud computing resources that include powerful algorithms for solving all the problems mentioned in this chapter, so in this section we focus on understanding the type of the network problem and its applications instead of the algorithms devised to solve them.

旅行商问题

Traveling Salesman Problem

这是这是运筹学中的一个著名问题,适用于许多现实世界的情况。一名推销员在一次旅行中需要访问多个城市。考虑到城市之间的距离,他应该按什么顺序出行,才能精确地访问每个城市一次并返回家乡,以保持行驶距离最短(图10-1

This is a famous problem in operations research that fits into many real-world situations. A salesman is required to visit a number of cities during a trip. Given the distances between the cities, in what order should he travel so as to visit every city precisely once and return home, with the objective of keeping the distance traveled at a minimum (Figure 10-1)?

电子邮件1001
图 10-1。旅行商问题(图片来源

应用有很多:离开仓库的送货卡车必须以成本最低的方式(以时间或距离衡量)将包裹运送到每个地址;或者,在制造电子芯片时,我们必须找到在印刷电路板上钻孔的最有效的孔顺序。

Applications are numerous: a delivery truck leaving a warehouse must deliver packages to every address in the least costly way (measured by time or distance); or we must find the most efficient hole sequence to drill on a printed circuit board when manufacturing electronic chips.

我们将旅行商问题表示为图上的优化问题:城市是节点,每对城市之间都有边(使图完整),每条边都有一个代表距离的权重(或属性或特征)两个城市之间。该图有许多路径仅经过所有城市一次并返回到我们开始的路径哈密顿回路),但我们想要距离总和最小的回路。

We represent the traveling salesman problem as an optimization problem on a graph: the cities are the nodes, there are edges between each pair of cities (making the graph complete), and each edge has a weight (or attribute or feature) representing the distance between the two cities. This graph has many paths passing through all the cities only once and returning to the one we started with (a Hamiltonian circuit), but we want the one with the smallest sum of distances.

让我们考虑一下这个问题的复杂性。n 个节点的完整图中不同哈密顿电路的总数为 ( n -1)!/2。从任何节点开始,我们有 n-1 个边可供选择,以选择下一个要访问的城市,然后是第二个城市的n −2 个选项,第三个城市的n −3 个选项,依此类推。这些选择是独立的,所以我们总共有 ( n −1) 个!选择。我们必须除以 2 来考虑对称性,即我们可以向前或向后遍历相同的哈密顿电路,并且仍然获得完全相同的总行驶距离。这个计数问题是一个具有对称性的循环排列。旅行推销员的详尽解决方案将列出所有 ( n −1)!/2 哈密顿回路,将每个回路的行驶距离相加,然后选择距离最短的回路。即使对于合理的n值来说,它也太昂贵了;例如,要访问美国所有 50 个州首府(假设我们希望最小化总旅行成本),我们需要尝试 50 - 1 / 2 = 3 04 × 10 62 选项!对于任意大小的问题,我们没有有效的算法。启发式方法可以提供出色的近似解。此外,基于分支和剪切方法的优秀算法已经解决了这个问题,对于大量的城市。

Let’s think of the complexity of this problem. The total number of different Hamiltonian circuits in a complete graph of n nodes is (n−1)!/2. Starting at any node, we have n-1 edges to choose from to pick the next city to visit, then n−2 options from the second city, n−3 from the third city, and so on. These choices are independent, so we have a total of (n−1)! choices. We must divide by 2 to account for symmetry, in the sense that we can traverse the same Hamiltonian circuit forward or backward and still get the exact same total distance traveled. This counting problem is a circular permutation with symmetry. An exhaustive solution of the traveling salesman would list all (n−1)!/2 Hamiltonian circuits, adding up the distance traveled in each, then choosing the one with the shortest distance. Even for a reasonable value of n, it is too expensive; for example, to visit all 50 US state capitals (say we want to minimize total trip cost), we would need to try ( 50 - 1 ) ! / 2 = 3 . 04 × 10 62 options! We do not have an efficient algorithm for problems of arbitrary size. Heuristic methods can provide excellent approximate solutions. Moreover, great algorithms based on an approach called branch and cut have solved this problem to optimality for very large numbers of cities.

最小生成树

Minimum Spanning Tree

我放最小生成树问题紧随旅行推销员之后,因为有时人们会混淆两者。这是一个消除混乱的好地方。在这里,我们有一个完全连接的网络,每个边都有正权重,这些权重又可以代表距离、时间、容量或连接水、电或电话线等基础设施的成本。与旅行商问题类似,我们想要找到包含图的所有节点并最小化总权重的边集。这里的要求与旅行推销员不同的是,我们希望确保以提供任意两对节点之间的路径的方式选择边集,这意味着我们可以从任何其他节点到达图中的任何节点。在旅行商问题中,我们只需要访问每个城市一次,然后返回起始城市,这意味着每个节点不能获得超过两条边(生成树没有这样的要求)。事实上,我们回到旅行商问题中的最后一个城市,这意味着我们有一个额外的电路闭合边,我们不需要生成树。如果我们删除旅行推销员解决方案的最后一条边,那么我们肯定会得到一棵生成树;然而,不能保证它是成本最低的。图 10-2显示了同一图的最小生成树和旅行商解决方案。

I put the minimum spanning tree problem right after the traveling salesman because sometimes people confuse the two. This is a good place to clear the confusion. Here, we have a fully connected network with positive weights associated with each edge, which again can represent distance, time, capacity, or cost of connecting infrastructure such as water, electric, or phone lines. Similar to the traveling salesman problem, we want to find the set of edges that includes all the nodes of the graph and minimizes the total weight. The requirement here that is different than the traveling salesman is that we want to make sure we choose the set of edges in a way that provides a path between any two pairs of nodes, meaning we can reach any node in the graph from any other node. In the traveling salesman problem, we need to visit every city only once, then return to the starting city, which means that each node cannot get more than two edges (no such requirement for a spanning tree). The fact that we return to the last city in the traveling salesman problem means that we have an extra circuit closing edge that we do not need for spanning trees. If we remove that last edge of a traveling salesman solution, then we definitely get a spanning tree; however, there is no guarantee that it is the one with minimal cost. Figure 10-2 shows minimum spanning tree and traveling salesman solutions of the same graph.

电子邮件1002
图 10-2。同图的最小生成树和旅行商解

请注意,对于任何网络,如果我们有n 个节点,那么我们只需要n -1 条边,以便每两个节点之间都有一条路径,因此我们永远不应该为最小生成树使用超过n -1 条边,因为这会增加我们的成本。我们需要选择使成本最小化的边集。

Note that for any network, if we have n nodes then we only need n-1 edges so that we have a path between every two nodes, so we should never use more than n-1 edges for a minimal spanning tree because that would increase our cost. We need to choose the set of edges that minimizes the cost.

我们已经提到了一些应用,例如设计电信网络、路由和运输网络、电力网络和基础设施网络(管道)。这些网络的开发成本高昂,对它们进行最佳设计可以节省数百万美元美元。

We have already mentioned some applications, such as designing telecommunication networks, routing and transportation networks, electric networks, and infrastructure networks (pipelines). These networks are expensive to develop, and designing them optimally saves millions of dollars.

最短的路径

Shortest Path

最短路径问题的最简单版本是,我们在图上有两个节点,我们希望将它们与一组边连接起来,以便边权重(距离,时间)的总和最小。这与旅行推销员和最小生成树问题不同,因为我们不关心覆盖图的所有节点。我们关心的只是以成本最低的方式从出发地到达目的地。

The simplest version of the shortest path problem is that we have two nodes on a graph and we want to connect them with a set of edges so that the total sum of the edge weights (distance, time) is minimal. This is different than the traveling salesman and the minimal spanning tree problems because we don’t care about covering all the nodes of the graph. All we care about is getting ourselves from the origin to the destination in the least costly way.

一个明显的应用是以最小的距离、成本、时间等从一个目的地到另一个目的地的旅行。其他不是立即明显但非常重要的应用是活动网络。我们可能没有起点和终点,而是项目的开始和结束。每个节点代表一个活动,每个边权重代表活动i与活动j相邻时所产生的成本或时间(如果我们有一个有向图,那么这将是活动i发生在活动j之后所产生的成本或时间) 。目标是选择使总成本最小化的活动顺序。

One obvious application is travel from one destination to another with minimal distance, cost, time, etc. Other applications that are not immediately obvious but nevertheless are ultra-important are activity networks. Instead of an origin and a destination, we might have a beginning of a project and an end. Each node represents an activity, and each edge weight represents the cost or the time incurred if activity i is adjacent to activity j (if we have a directed graph, then it would be the cost or time incurred if activity i happens after activity j). The goal is to choose the sequence of activities that minimizes the total cost.

最短路径问题的其他版本包括查找从原点到所有其他节点的最短路径,或查找所有节点对之间的最短路径。

Other versions of the shortest path problem include finding the shortest path from an origin to all other nodes, or finding the shortest paths between all pairs of nodes.

许多车辆路由算法和网络设计算法都包含最短路径算法作为子程序。

Many vehicle routing algorithms and network design algorithms include shortest path algorithms as subroutines.

我们还可以将最短路径问题表示为线性优化问题,并使用线性可用的方法优化。

We can also formulate the shortest path problem as a linear optimization problem and use the methods available for linear optimization.

最大流量最小切割

Max-Flow Min-Cut

在这里,我们也有起点和目的地,每个有向边都有某种容量(路线上允许的最大车辆数量、路线上运输的最大商品数量、材料或自然资源的最大数量,例如石油或水,管道可以处理的),并且我们希望找到最大化从起点到目的地的流量的边集。请注意,所有边都远离原点并指向目的地。

Here we also have an origin and a destination, each directed edge has a capacity of some sort (max number of vehicles allowed on a route, max number of commodities shipped on a route, max amount of material or natural resource, such as oil or water, that a pipeline can handle), and we want to find the set of edges that maximizes the flow from the origin to the destination. Note that all edges point away from the origin and point toward the destination.

图论中的一个非常重要的定理在确定连接起点和目的地的一组边的最优性(最大流量)方面起着至关重要的作用:

A very important theorem from graph theory plays a crucial role in determining the optimality (max flow) of a set of edges connecting the origin to the destination:

最大流最小切割定理表示,通过有向网络从起点到目的地的最大流量等于切断起点和目的地之间的任何通信所需的边的最小权重之和。也就是说,我们可以通过多种方式切断网络来阻止始发地和目的地之间的通信。切断通信且权重最小的边集是最小割集。该最小割集的值等于网络中可能的最大流的值。这个结果非常直观:我们可以通过网络边缘发送的最多内容是多少?这受到连接起点和目的地至关重要的边缘容量的限制。

The max-flow min-cut theorem says the maximum flow from the origin to the destination through the directed network is equal to the minimal sum of weights of the edges required to cut any communication between the origin and the destination. That is, we can cut through the network to prevent communication between the origin and the destination in more than one way. The set of edges that cuts communication and has the least weight is the minimal cut set. The value of this minimal cut set is equal to the value of the maximum flow possible in the network. This result is pretty intuitive: what’s the most that we can send through the edges of the network? This is bounded from above by the capacities of the edges crucial for connecting the origin to the destination.

我们可以将最大流问题重新表述为线性优化问题,当然,最小割问题将是它的对偶问题,所以它们当然有相同的解决方案!我们很快就会在本章中看到这一点。

We can reformulate the max flow problem as a linear optimization problem, and of course, the min cut problem will be its dual, so of course they have the same solution! We will see this soon in this chapter.

最后,假设我们有多个出发地和多个目的地,类似于分销网络。然后我们仍然可以通过解决完全相同的问题来最大化通过网络的流量,除了现在我们添加一个指向所有真实起源的虚构超级起源,以及所有真实目的地都指向的另一个虚构超级目的地,具有无限的容量,然后照常进行,用两个新的虚构超级节点解决这个新图上的最大流量。

Finally, suppose we have more than one origin and more than one destination, similar to a distribution network. Then we can still maximize the flow through the network by solving the exact same problem, except now we add a fictional super origin pointing to all the real origins, and another fictional super destination that all the real destinations point toward, with infinite capacities, then do business as usual, solving for the max flow on this new graph with two new fictional super nodes.

最大流量最小成本

Max-Flow Min-Cost

这是与最大流量问题类似,只不过现在我们有一个与通过每条边发送流量相关的成本,该成本与流量单位的数量成比例。目标显然是在满足从所有始发地到所有目的地的供应的同时,最大限度地降低成本。我们可以将此问题表示为线性优化问题,并使用针对网络优化的单纯形方法来求解。应用无处不在且非常重要:各种分销网络,包括供应节点、转运节点、需求节点、供应链(货物、血液、核材料、食品)、固体废物管理网络,协调产品类型生产或花费资源以满足市场、现金流管理和分配问题,例如将员工分配给任务、将时间段分配给任务或将求职者分配给可用的工作。

This is similar to the max flow problem, except that now we have a cost associated with sending a flow through each edge proportional to the number of units of flow. The goal is obviously to minimize the cost while satisfying the supply from all the origins to all the destinations. We can formulate this problem as a linear optimization problem and solve it using a simplex method optimized for networks. Applications are ubiquitous and so important: all kinds of distribution networks, with supply nodes, trans-shipment nodes, demand nodes, supply chains (of goods, blood, nuclear materials, food), solid waste management networks, coordinating the types of products to produce or spend resources to satisfy the market, cash flow management, and assignment problems, such as assigning employees to tasks, time slots to tasks, or job applicants to available jobs.

分配问题

The Assignment Problem

分配问题也称为匹配问题。受托人的数量应与任务的数量相同,每个人只能分配一项任务,并且每项任务只能由一名受托人执行。将任务i分配给受让人j是有成本的。目标是选择任务和受让人之间的匹配,使总成本最小化。此类问题的图是一种特殊类型,称为二部图。这样的图可以分为两部分,其中所有边都从第一部分中的一个节点到第二部分中的一个节点。所有权重都相同的分配问题是二部图上的最大流问题。我们所要做的就是分配一个虚构的超级起点和另一个虚构的超级目的地,并以与我们在即将到来的线性优化和对偶性部分中解决最大流问题相同的方式解决问题。对于这些问题有许多有效的算法。

The assignment problem is also called the matching problem. The number of assignees should be the same as the number of tasks, each can be assigned only one task, and each task can be performed by only one assignee. There is a cost to assigning task i to assignee j. The objective is to choose the matching between tasks and assignees that minimizes the total cost. The graph of such a problem is a special type called a bipartite graph. Such a graph can be divided into two parts, where all the edges go from one node in the first part to one node in the second part. An assignment problem where all the weights are the same is a max flow problem on a bipartite graph. All we have to do is assign a fictional super origin and another fictional super destination and solve the problem the same way we solve the max flow problem in the upcoming section on linear optimization and duality. There are many efficient algorithms for these problems.

项目设计的关键路径法

The Critical Path Method for Project Design

关键路径法(CPM)是一种网络上的优化方法,代表项目中所有涉及的活动、总预算、总时间限制、哪些活动需要先于其他活动发生、每项活动花费多少时间和成本,以及哪些活动需要在其他活动之前发生。活动可以同时发生。例如,考虑一个房屋建设项目从开始到结束的整个过程。时间和成本权衡的关键路径方法是一个很好的工具,可以帮助设计一个包含时间和成本之间权衡的项目,并确保项目以最低的总成本满足其最后期限。与关键路径法类似的是项目评估评审技术 (PERT),一种项目管理规划工具,用于计算完成项目所需的时间。两种方法都提供三个时间线:最短可能时间线、最长可能时间线和最可能时间线。

The critical path method (CPM) is an optimization method on a network representing all the involved activities in a project, the total budget, the total time constraint, which ones need to happen before others, how much time and cost each activity incurs, and which activities can happen simultaneously. Think, for example, of a house construction project from start to finish. The critical path method for time and cost trade-offs is a great tool to aid in designing a project that incorporates trade-offs between time and cost, and make sure the project meets its deadlines at a minimal total cost. Similar to the critical path method is the Program Evaluation Review Technique (PERT), a project management planning tool that computes the amount of time it will take to complete a project. Both methods provide three timelines: a shortest possible timeline, a longest possible timeline, and a most probable timeline.

n 皇后问题

The n-Queens Problem

继续讨论线性优化、单纯形法和对偶性,我们绕了一个小弯路,提到了一个有趣的组合问题,它困扰了数学家 150 年,主要是因为它完全缺乏结构:n 皇后问题例如图10-3中的一个。Michael Simkin终于(2021 年 7 月)回答了已有 150 年历史的国际象棋n皇后问题。以下是他的解决方案论文摘要的编辑部分,标题为“n-皇后配置的数量”

Before moving on to linear optimization, the simplex method, and duality, we make a tiny detour and mention an interesting combinatorial problem that has puzzled mathematicians for 150 years, mainly because of its utter lack of structure: the n-queens problem, such as the one in Figure 10-3. Michael Simkin has finally (July 2021) answered the 150-year-old chess-based n-queens problem. Here is an edited part of the abstract of his solution paper, titled “The Number of n-Queens Configurations”:

n皇后问题是确定将 n 个互不威胁的皇后放置在一个 n × n 【象棋】棋盘。我们证明存在一个常数 α = 1 第942章 ± 3 × 10 -3 [将互不构成威胁的皇后放置在棋盘上的方法的数量是] 1±1ne -α n 1±1ne -α n 。常数 α 被表征为 P([−1/2,1/2] 中凸优化问题的解 2 ,Borel 概率测度在平方上的空间。

The n-queens problem is to determine the number of ways to place n mutually nonthreatening queens on an n × n [chess] board. We show that there exists a constant α = 1 . 942 ± 3 × 10 -3 such that [the number of ways to place the mutually nonthreatening queens on the board is] (1±o(1)ne -α ) n ((1±o(1))ne -α ) n . The constant α is characterized as the solution to a convex optimization problem in P([−1/2,1/2] 2 ) , the space of Borel probability measures on the square.

电子邮件1003
图 10-3。8 × 8 棋盘上的八个皇后处于互不威胁的位置

网页有一个简单的回溯算法来解决n皇后问题。请注意,Simkin 的解决方案量化了可行皇后配置的总数,而算法仅找到其中一种或某些配置。

This web page has an easy backtracking algorithm for solving the n-queens problem. Note that the solution by Simkin quantifies the total number of viable queen configurations, while algorithms only find one or some of these configurations.

线性优化

Linear Optimization

任何有限维中的优化问题,无论是线性的还是非线性的,看起来像:

Any optimization problem in finite dimensions, whether linear or nonlinear, looks like:

分钟 G 1 X 0G 2 X 0G X 0 F X

一个点 X 恰好满足所有约束条件是一个可行点。我们有以下案例:

A point x that happens to satisfy all the constraints is a feasible point. We have the following cases:

最优解只有一个
There is only one optimal solution

认为目标函数的景观只有一个最低点。

Think that the landscape of the objective function has only one lowest point.

最优解有多个
There are multiple optimal solutions

在这种情况下,最优解集可以是有界的或无界的。

In this case, the set of optimal solutions can be bounded or unbounded.

最优值变为 - 无穷大
The optimal value goes to -

目标函数的情况会无限期地走下坡路,因此没有一个可行点是最优的。

The landscape of the objective function goes downhill indefinitely, so no feasible point is optimal.

可行集为空
The feasible set is empty

我们不关心目标函数及其低值,因为不存在同时满足所有约束的点。最小化问题无解。

We do not care about the objective function and its low values, since there are no points that satisfy all the constraints at the same time. The minimization problem has no solution.

最优值是有限的但尚未达到
The optimal value is finite but not attained

即使可行集非空,也没有优化器。例如, 信息 X0 1 X 等于 0,但不存在满足1/ x = 0的有限x 。对于线性问题,这种情况永远不会发生。

There is no optimizer, even when the feasible set is nonempty. For example, inf x0 1 x is equal to zero but there is no finite x such that 1/x = 0. This never happens for linear problems.

对于线性优化问题,目标函数f和所有约束g都必须是线性函数。线性优化在运筹学中占据了最大份额,因为我们可以将许多运筹学问题建模为具有线性约束(可以是等式或不等式)的线性函数的最小化。

For an optimization problem to be linear, both the objective function f and all the constraints g must be linear functions. Linear optimization gets the lion’s share in operations research, since we can model many operations research problems as a minimization of a linear function with linear constraints that can either be equalities or inequalities.

通用表格和标准表格

The General Form and the Standard Form

线性度是一件伟大的事情,因为它打开了线性代数的所有工具(向量和矩阵计算)。人们通常处理两种形式的线性优化问题:

Linearity is such a great thing, as it opens up all tools of linear algebra (vector and matrix computations). There are two forms of linear optimization problems that people usually work with:

一般形式
The general form

这有利于发展线性规划理论。这里,对决策变量的符号(向量的条目)没有限制 X ):

分钟 AX C X

可行集 A X 是一个多面体,我们可以将其视为有限数量的具有平坦边界的半空间的交集。该多面体可以是有界的或无界的。我们很快就会看到例子。

This is convenient for developing the theory of linear programming. Here, there is no restriction on the signs of the decision variables (the entries of the vector x ):

min Ax b ( c . x )

The feasible set A x b is a polyhedron, which we can think of as the intersection of a finite number of half-spaces with flat boundaries. This polyhedron can be bounded or unbounded. We will see examples shortly.

标准形式
The standard form

这对于计算和开发算法很方便,例如单纯形法和内点法。决策变量必须是非负的,因此我们仅在第一个超八分体(第一象限的高维模拟)中搜索优化器,其中所有坐标都是非负的。此外,约束必须始终是平等的,而不是不平等的,因此我们位于多面体的边界上,而不是内部。这是一个以标准形式编写的线性优化问题:

分钟 AX = X 0 C X

有一个简单的方法可以让我们直观地理解标准形式的线性问题:合成向量 以最小化成本的方式从A的列中提取 C X

This is convenient for computations and developing algorithms, like the simplex and interior point methods. The decision variables must be nonnegative, so we are only searching for optimizers in the first hyperoctant, the high-dimensional analog of the first quadrant, where all the coordinates are nonnegative. Moreover, the constraints must always be equalities, not inequalities, so we are on the boundary of the polyhedron, not in the interior. This is a linear optimization problem written in standard form:

min Ax =b x 0 ( c . x )

There is an easy way for us to intuitively understand a linear problem in standard form: synthesize the vector b from the columns of A in a way that minimizes the cost c . x .

我们可以轻松地在线性优化问题的标准形式和一般形式之间来回切换。例如,我们可以引入剩余变量和松弛变量将一般线性优化问题转换为标准形式,但请注意,在这样做的过程中,我们最终会在不同维度上遇到相同的问题。当我们引入一个变量将不等式变为等式时,比如引入 s 1 转换不等式 X 1 - 3 X 2 4 到平等 X 1 - 3 X 2 - s 1 = 4 ,我们增加维度(在本例中从 2 增加到 3)。没事儿。实际上,数学的好处之一是,即使我们只生活在三维空间中,我们也可以对无限数量的维度进行建模。世界。

We can easily go back and forth between the standard form and general form of a linear optimization problem. For example, we can introduce surplus and slack variables to convert a general linear optimization problem to standard form, but note that in the process of doing that, we end up with the same problem in different dimensions. When we introduce a variable to change an inequality into an equality, such as introducing s 1 to convert the inequality x 1 - 3 x 2 4 to the equality x 1 - 3 x 2 - s 1 = 4 , we increase the dimension (in this example from two to three). That is fine. It is actually one of the nice things about math that we can model an unlimited amount of dimensions even though we only live in a three-dimensional world.

二维可视化线性优化问题

Visualizing a Linear Optimization Problem in Two Dimensions

让我们可视化以下二维问题,它既不是一般形式也不是标准形式(但我们可以轻松地将其转换为任一形式,对于这个简单问题我们不需要这样做,因为我们可以通过检查来提取最小值图形):

Let’s visualize the following two-dimensional problem, which is neither in general form nor in standard form (but we can easily convert it to either form, which we do not need to do for this simple problem, as we can extract the minimum by inspecting the graph):

分钟 X+2y32X+y3X0y0 - X - y

图 10-4显示了该线性优化问题的所有约束(直线)的边界,以及由此产生的可行集。目标函数–x – y的最优值为–2,在点 (1,1) 处获得,该点是可行集的角点之一。

Figure 10-4 shows the boundaries of all the constraints (straight lines) of this linear optimization problem, along with the resulting feasible set. The optimal value of the objective function –x – y is –2, attained at the point (1,1), which is one of the corners of the feasible set.

电子邮件1004
图 10-4。可行集,在角点 (1,1) 处获得的 –x – y 的最优值为 –2

如果这是一个无约束问题,那么–x – y的下确界将是 - 无穷大 。限制会产生巨大的差异。最佳值位于多边形(二维多面体)的一个角这一事实并非巧合。如果我们为某个c绘制直线–x – y = c,并将该线的一部分置于可行集内,然后朝梯度向量的负方向移动(回想一下,这是最快下降的方向),该线将沿向量的方向移动 - (– xy ) = –(–1, –1) = (1, 1) (梯度向量与优化点的坐标相同绝对是巧合,因为这两者完全无关)。只要该线的一部分位于可行集中,我们就可以继续推动并使c变小,直到我们无法再推动为止,因为如果我们这样做,我们将退出可行集,变得不可行,并失去所有推动工作。当整条线位于可行集之外并且勉强悬挂在仍在可行集内的点 (1,1) 时,就会发生这种情况。

If this was an unconstrained problem, then the infimum of –x – y would instead be - . Constraints make a huge difference. The fact that the optimal value is at one of the corners of the polygon (two-dimensional polyhedron) is not a coincidence. If we draw the straight line –x – y = c for some c that places part of the line inside the feasible set, then move in the direction of the negative of the gradient vector (recall that this is the direction of fastest descent), the line would move in the direction of the vector - (–xy) = –(–1, –1) = (1, 1) (it is definitely a coincidence that the gradient vector has the same coordinates as the optimizing point, as these two are completely unrelated). As long as the line has parts of it inside the feasible set, we can keep pushing and making c smaller until we can’t push anymore, because if we did, we would exit the feasible set, become infeasible, and lose all our pushing work. This happens exactly when the whole line is outside the feasible set and barely hanging at the point (1,1), which is still in the feasible set.

我们找到了优化器,即使–x – y的值最小的点。我们很快就会回到解决可行的线性问题集的角落,因为这就是问题所在优化器是。

We found our optimizer, the point that makes the value of –x – y smallest. We will get back to moving through the corners of feasible sets of linear problems soon, because that’s where the optimizers are.

凸到线性

Convex to Linear

即使当目标函数是非线性的,在许多情况下,我们可能很幸运能够将问题重新表述为线性问题,然后使用线性优化技术来获得精确解或精确解的近似值。其中一种情况是目标函数是凸的。在优化问题中,继线性之后,凸性是下一个理想的东西,因为我们不会担心陷入局部最小值:凸函数的局部最小值也是全局最小值。

Even when the objective function is nonlinear, in many cases we may be lucky enough to reformulate the problem as a linear problem, then use linear optimization techniques to either obtain an exact solution or an approximation to the exact solution. One such case is when the objective function is convex. In optimization problems, after linearity, convexity is the next desirable thing, because we wouldn’t worry about getting stuck at a local minimum: a local minimum for a convex function is also a global minimum.

我们总是可以通过分段线性凸函数来近似凸(可微)函数,如图10-5所示。之后,我们可以将分段线性目标函数的优化问题转化为线性目标函数的优化问题。然而,这个过程使我们在第一步中失去可微性(函数不再平滑)并在第二步中增加维度。没有什么是免费的。

We can always approximate a convex (and differentiable) function by a piecewise linear convex function, as in Figure 10-5. Afterward, we can turn the optimization problem with a piecewise linear objective function into one with a linear objective function. This process, however, makes us lose differentiability in the first step (the function stops being smooth) and increase the dimension in the second step. Nothing is free.

电子邮件1005
图 10-5。通过分段线性函数逼近凸函数

优化问题具有凸目标函数和凸可行集。凸优化是一个独立的领域。

A convex optimization problem has a convex objective function and a convex feasible set. Convex optimization is a whole field of its own.

凸函数

Convex Function

A功能 F : n 是凸的当且仅当 F λ X + 1 - λ y λ F X + 1 - λ F y , 对全部 X , y ε n 0 λ 1 这意味着连接f图上任意两点的线段位于f图的上方。

A function f : n is convex if and only if f ( λ x + ( 1 - λ ) y ) λ f ( x ) + ( 1 - λ ) f ( y ) , for all x , y n and 0 λ 1 . This means that the segment connecting any two points on the graph of f lies above the graph of f.

有关凸函数的有用事实:

Helpful facts about convex functions:

  • 凸函数不能具有不能成为全局最小值的局部最小值。

  • A convex function cannot have a local minimum that fails to be a global minimum.

  • 如果函数 F 1 , F 2 , ... , F : n 是凸函数,那么函数 F X = 最大限度 F X 也是凸的。在这种情况下, f可能会失去平滑度,因此优化方法将无法使用导数。

  • If the functions f 1 , f 2 , ... , f m : n are convex functions, then the function f ( x ) = max i f i ( x ) is also convex. f may lose smoothness in this case, so optimization methods would not be able to use derivatives.

  • 功能 F X = 最大限度 { 1 X + d 1 , 2 X + d 2 , , n X + d n } ,或者更紧凑地, F X = 最大限度 =1,2,n { X + d } ,是分段线性的,如图10-6所示。这是一个凸函数,因为每个 X + d 是凸的(线性函数同时是凸的和凹的),凸函数的最大值也是凸的。

  • The function f ( x ) = max { m 1 x + d 1 , m 2 x + d 2 , , m n x + d n } , or more compactly, f ( x ) = max i=1,2,n { m i x + d i } , is piecewise linear, as in Figure 10-6. This is a convex function, since each m i x + d i is convex (linear functions are convex and concave at the same time), and the maximum of convex functions is also convex.

电子邮件1006
图 10-6。线性函数的最大值是分段线性且凸的

现在我们可以将分段线性凸目标函数的优化问题重新表述为线性优化问题:

Now we can reformulate optimization problems with piecewise linear convex objective functions as linear optimization problems:

分钟 AX 最大限度 X + d 分钟 AXz X+d z

请注意,当我们添加新的决策变量z时,我们增加了维度。

Note that we increased the dimension when we added a new decision variable z.

例如,绝对值函数f(x) = |x| = max{x,–x}是分段线性且凸的。我们可以通过两种方式将优化问题重新表述为线性优化问题,其中目标函数包括决策变量的绝对值(此处为 C 目标函数中的 必须为非负数,否则目标函数可能是非凸的):

For example, the absolute value function f(x) = |x| = max{x,–x} is piecewise linear and convex. We can reformulate an optimization problem where the objective function includes absolute values of the decision variables as a linear optimization problem in two ways (here the c i ’s in the objective function must be nonnegative, otherwise the objective function might be nonconvex):

分钟 AX Σ =1 n C | X | 分钟 AXz X -z X Σ =1 n C z 分钟 AX + -AX - X + ,X - 0 Σ =1 n C X + + X -

线性优化的几何

The Geometry of Linear Optimization

让我们考虑标准形式的线性优化问题的几何形状,因为这是搜索最小化器的算法最方便的形式。几何是关于所涉及的形状、线、面、点、边、角等的问题。标准形式的问题:

Let’s think of the geometry of a linear optimization problem in standard form, as this is the form that is most convenient for algorithms searching for the minimizer. Geometry is all about the involved shapes, lines, surfaces, points, edges, corners, etc. The problem in standard form:

分钟 AX = X 0 C X

涉及线性代数方程。我们想要了解与这些方程相关的几何图,以及最小化过程。回想一下,线性是平坦的,当平坦的物体彼此相交时,它们会创建超平面、直线和/或角。

involves linear algebraic equations. We want to understand the geometric picture associated with these equations, along with the minimization process. Recall that linear is flat, and when flat things intersect with each other, they create hyperplanes, lines, and/or corners.

最小化问题的线性约束定义了一个多面体。我们对这个多面体的角非常感兴趣。但我们怎么知道多面体有角呢?如果只是半个空间怎么办?正如我们之前提到的,如果我们从一般形式变为标准形式,我们的维度就会跃升。此外,我们对决策变量强制执行非负性。因此,即使多面体在一般形式中没有角,它在高维标准形式中也总是有角:标准形式多面体位于第一个超八分圆中,因此不可能包含完整的线。这很好。我们有定理保证,对于线性优化问题,最优值是 - 无穷大 ,或者存在在多面体的一个角处达到的有限最优值。所以我们在寻找优化器的时候一定要把注意力集中在这些角落。由于许多与现实世界约束相关的多面体有数以万计的角,我们需要有效的方法来筛选它们。

The linear constraints of the minimization problem define a polyhedron. We are highly interested in the corners of this polyhedron. But how do we know that the polyhedron has corners? What if it is only a half-space? As we mentioned before, if we change from general form to standard form, we jump up in dimension. Moreover, we enforce nonnegativity on the decision variables. Therefore, even if a polyhedron has no corners in the general form, it will always have corners in its higher-dimensional standard form: the standard form polyhedron gets situated in the first hyperoctant, and hence cannot possibly contain full lines. This is good. We have theorems that guarantee that, for a linear optimization problem, either the optimal value is - , or there exists a finite optimal value attained at one of the corners of the polyhedron. So we must focus our attention on these corners when searching for the optimizer. Since many polyhedra that are associated with real-world constraints have tens of thousands of corners, we need efficient ways to sift through them.

直观上,我们可以从多面体一个角的坐标开始,然后逐步找到最佳角。但是我们如何找到这些角点的坐标呢?我们使用线性代数方法。这就是为什么将约束表示为线性系统很方便 A X = (和 X 0 )。在线性优化语言中,角点称为一个基本可行的方案。这个代数名称表明角点的坐标满足所有约束(可行),并求解与某些基相关的一些线性方程(从系统中提取) A X = ,具体来说,来自A的m列)。

Intuitively, we can start with the coordinates of one corner of the polyhedron, then work our way to an optimal corner. But how do we find the coordinates of these corners? We use linear algebra methods. This is why it is convenient to express the constraints as a linear system A x = b (with x 0 ). In linear optimization language, a corner is called a basic feasible solution. This algebraic name indicates that the corner’s coordinates satisfy all the constraints (feasible), and solve some linear equations associated with some basis (extracted from the system A x = b , specifically, from m columns of A).

代数和几何的相互作用

The interplay of algebra and geometry

讨论单纯形法,让我们将约束的代数方程(或一般形式的问题的不等式)与几何心理图像联系起来:

Before discussing the simplex method, let’s associate the algebraic equations (or inequalities for problems that are in general form) of the constraints with geometric mental images:

多面体
Polyhedron

约束作为一个整体形成一个多面体。在代数上,多面体是满足线性系统的点的集合 X ε n ,使得 A X 对于一些 A ×n ε

The constraints as a whole form a polyhedron. Algebraically, a polyhedron is the set of points satisfying the linear system x n , such that A x b for some A m×n and b m .

半空间的内部
Interior of a half-space

这里我们只考虑约束条件下的一个不等式,而不是整个系统 A X ,即 A X > (一个不平等约束的严格不平等部分)。这对应于相对于多面体的一个面位于一侧的所有点。不等式是严格的,因此我们位于半空间的内部而不是边界上。

Here we consider only one inequality from the constraints, not the whole system A x b , namely, a i . x > b i (the strict inequality part of one inequality constraint). This corresponds to all the points that lie on one side relative to one face of the polyhedron. The inequality is strict, so that we are in the interior of the half-space and not on the boundary.

超平面
Hyperplane

这里我们只考虑一个等式约束, A X = ,或仅不等式约束的等式部分。这是半空间的边界 A X > ,或多面体的一个面。

Here we consider only one equality constraint, a i . x = b i , or only the equality part of an inequality constraint. This is the boundary of the half-space a i . x > b i , or one face of the polyhedron.

主动约束
Active constraints

当我们插入一个点的坐标时 X * 陷入约束 A X 我们得到平等,即 A X * = ,则此时约束处于活动状态。从几何角度来说,这个地方 X * 位于半空间的边界,而不是内部。

When we plug the coordinates of a point x * into a constraint a i . x b i and we get equality, that is, a i . x * = b i , then the constraint is active at this point. Geometrically, this places x * at the boundary of the half-space, not in the interior.

多面体的角
Corner of the polyhedron

从几何角度来看,必须有正确数量的超平面相交才能形成角。从代数上来说,正确数量的约束在角点处处于活动状态。这是一个基本的可行解,我们将在讨论单纯形法时对其进行回顾。

Geometrically, the right number of hyperplanes have to meet to form a corner. Algebraically, the right number of constraints is active at a corner point. This is a basic feasible solution, and we will go over it while discussing the simplex method.

相邻碱基查找相邻角点
Adjacent bases to find adjacent corners

这是矩阵A的列的两个子集,它们共享除一列之外的所有列。我们使用它们来计算相邻角的坐标。在单纯形法中,我们需要从一个角几何移动到相邻的角,这些相邻的底帮助我们以系统的代数方式完成这一任务。我们还将在讨论单纯形法时讨论这一点。

These are two subsets of the columns of the matrix A that share all but one column. We use these to compute the coordinates of corners that are adjacent. In the simplex method, we need to geometrically move from one corner to an adjacent one, and these adjacent bases help us accomplish that in a systematic algebraic way. We will also go over this while discussing the simplex method.

退化案例
Degenerate case

为了形象化这一点,假设我们有两条线在二维上相交,形成一个角。现在,如果第三条线在完全相同的点处与它们相遇,或者换句话说,如果两个以上的约束在二维的同一点处处于活动状态,那么我们就会遇到退化情况。在n维中,点 X * 有超过n 个超平面穿过它,或者超过n 个活动约束。从代数上来说,这就是我们的优化算法由于这种简并性而发生的情况:当我们选择A的另一组线性独立列来求解另一个基本可行解(角点)时,我们最终可能会得到与之前得到的相同的解,导致我们骑自行车算法!

To visualize this, suppose that we have two lines intersecting in two dimensions, forming a corner. Now if a third line meets them at exactly the same point, or in other words, if more than two constraints are active at the same point in two dimensions, then we have a degenerate case. In n dimensions, the point x * has more than n hyperplanes passing through it, or more than n active constraints. Algebraically, this is what happens to our optimization algorithm as a consequence of this degeneracy: when we choose another set of linearly independent columns of A to solve for another basic feasible solution (corner), we might end up with the same one we got before, leading to cycling in our algorithm!

单纯形法

The Simplex Method

我们的目标是设计一种算法,以标准形式找到线性优化问题的最优解:

Our goal is to devise an algorithm that finds an optimal solution for a linear optimization problem in standard form:

分钟 AX = X 0 C X

A × n m 个线性独立的行(所以 n ), × 1 , 和 C X n × 1 。不失一般性,我们假设Am行是线性无关的,这意味着问题的约束不存在冗余。这也保证了A中至少存在一组m个线性独立列(rank(A) = m)。我们需要这些线性独立的列或基来在多面体的某个角开始搜索优化器,并使用单纯形法从一个角移动到另一个角。

A is m × n with m linearly independent rows (so m n ), b is m × 1 , and c and x are n × 1 . Without loss of generality, we assume that the m rows of A are linearly independent, which means that there is no redundancy in the constraints of the problem. This also guarantees the existence of at least one set of m linearly independent columns of A (rank(A) = m). We need these linearly independent columns, or basis, to initiate our search for the optimizer, at a certain corner of the polyhedron, and to move from one corner to the other using the simplex method.

单纯形法的主要思想

The main idea of the simplex method

我们从多面体的一个角(也称为基本可行解)开始,沿着保证减少目标函数或成本的方向移动到另一个角,直到我们达到最优解或发现问题是无界且最优成本为 - 无穷大 (我们使用某些最优条件知道这些,并成为我们算法的终止标准)。在退化问题的情况下存在循环的可能性,但是当过程中存在联系时,我们可以通过做出明智的选择(系统的选择方式)来避免这种情况。

We start at a corner of the polyhedron (also called a basic feasible solution), move to another corner in a direction that is guaranteed to reduce the objective function, or the cost, until we either reach an optimal solution or discover that the problem is unbounded and the optimal cost is - (these we know using certain optimality conditions, and become the termination criteria for our algorithm). There is a chance of cycling in the case of degenerate problems, but we can avoid this by making smart choices (systematic way of choosing) when there are ties in the process.

单纯形法在多面体的角上跳跃

The simplex method hops around the corners of the polyhedron

以下是三个维度的线性优化问题:

The following is a linear optimization problem in three dimensions:

分钟 X 1 2,X 3 33X 2 +X 3 6,X 1 +X 2 +X 3 4X 1 ,X 2 ,X 3 0 - X 1 + 5 X 2 - X 3

图 10-7显示了与其七个线性约束相对应的多面体。

Figure 10-7 shows the polyhedron corresponding to its seven linear constraints.

电子邮件1007
图 10-7。单纯形法从多面体的一个角移动到下一个角,直到找到优化角

请注意,该问题不是标准形式。如果我们将其转换为标准形式,那么我们将获得四个额外维度,对应于将四个不等式约束转换为等式约束所需的四个新变量。我们无法想象七维多面体,但我们也不需要。我们可以在七个维度上使用单纯形算法,并在三个维度上跟踪重要变量: X 1 , X 2 , X 3 。这样,我们可以追踪单纯形方法从多面体的一个角移动到下一个角时的路径,从而减少目标函数的值 - X 1 + 5 X 2 - X 3 每一步,直到到达具有最小值的角。

Note that the problem is not in standard form. If we convert it to standard form, then we would gain four extra dimensions, corresponding to the four new variables that are required to convert the four inequality constraints to equality constraints. We cannot visualize the seven-dimensional polyhedron, but we do not need to. We can work with the simplex algorithm in seven dimensions and keep track of the important variables in three dimensions: ( x 1 , x 2 , x 3 ) . This way, we can trace the path of the simplex method as it moves from one corner of the polyhedron to the next one, reducing the value of the objective function - x 1 + 5 x 2 - x 3 at each step, until it arrives at a corner with a minimal value.

您可以跳过本节的其余部分,直接进入“传输和分配问题”,除非您对单纯形方法及其不同实现的细节感兴趣。

You can skip the rest of this section and go straight to “Transportation and Assignment Problems”, unless you are interested in the details of the simplex method and its different implementations.

单纯形法的步骤

Steps of the simplex method

对于标准形式的线性优化问题,单纯形法的进展如下:

For a linear optimization problem in standard form, the simplex method progresses like this:

  1. 从多面体的一个角开始(基本可行解 X * )。我们如何找到这个基本可行解的坐标呢?选择m个线性独立的列 A 1 , ... , A A。将它们放入矩阵B(基础矩阵)中。解决 X = 为了 X 。如果所有的 X 是非负的,我们已经找到了一个基本的可行解 X * 。将条目放置在 X 在相应的位置 X * ,并使其余为零。或者,我们可以解决 A X = 为了 X ,其中与所选列对应的条目是未知条目,其余条目为零。因此,一个基本可行的解决方案 X * 非基本坐标和基本坐标为零 X = -1

    例如,如果 A = 1 1 2 1 0 0 0 0 1 6 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 , 和 = 8 12 4 6 ,那么我们可以选择 A 4 , A 5 , A 6 , A 7 作为一组基本列,给出 X = 0,0,0,8,12,4,6 t 作为基本可行解(多面体一个顶点的坐标)。我们也可以选择 A 3 , A 5 , A 6 , A 7 作为另一组基本列,给出 X = 0,0,4,0,-12,4,6 t 作为基本解但不是基本可行,因为它有负坐标。

  2. Start at a corner of the polyhedron (basic feasible solution x * ). How do we find the coordinates of this basic feasible solution? Choose m linearly independent columns A 1 , ... , A m of A. Put them in a matrix B (basis matrix). Solve B x b = b for x b . If all the x b are nonnegative, we have found a basic feasible solution x * . Place the entries of x b in the corresponding positions in x * , and make the rest zero. Alternatively, we can solve A x = b for x , where the entries that correspond to the chosen columns are the unknown entries and the rest are zero. Therefore, a basic feasible solution x * has zero nonbasic coordinates and basic coordinates x b = B -1 b .

    For example, if A = 1 1 2 1 0 0 0 0 1 6 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 , and b = 8 12 4 6 , then we can choose A 4 , A 5 , A 6 , A 7 as a set of basic columns, giving x = (0,0,0,8,12,4,6) t as a basic feasible solution (coordinates of one vertex of the polyhedron). We can alternatively choose A 3 , A 5 , A 6 , A 7 as another set of basic columns, giving x = (0,0,4,0,-12,4,6) t as a basic solution but not basic feasible, because it has a negative coordinate.

  3. 移自 X * 到另一个角落 y * = X * + θ * d 。我们必须找到一个方向 d 使我们保持在多面体中(可行),仅增加一个非基本变量 X j 从零到正数,并将其他非基本变量保持为零。与此同时,当我们从 X * y * = X * + θ * d ,我们必须减小目标函数的值。也就是说,我们想要 C y * C X * 。当我们增加 X j 从零到正数,目标函数的差异为 C j = C j - C -1 A j ,因此我们必须选择一个坐标j,该量为负数。为了完成所有这些工作,坐标 d 最终成为 d j = 1 (因为我们介绍了 X j ), d = 0 如果 j 或者如果是非基础的,并且 d = - -1 A j ; 和价值 θ * 最终是:

    θ * = 分钟 全部基本的指数为了哪个d <0 - X d : = - X d ;
  4. Move from x * to another corner y * = x * + θ * d . We must find a direction d that keeps us in the polyhedron (feasible), increases only one nonbasic variable x j from zero to a positive number, and keeps the other nonbasic variables at zero. At the same time, when we move from x * to y * = x * + θ * d , we must reduce the value of the objective function. That is, we want c . y * c . x * . When we increase the value of x j from zero to a positive number, the difference in the objective function is c ¯ j = c j - c b . B -1 A j , so we must choose a coordinate j for which this quantity is negative. To make all of this work, the coordinates of d end up being d j = 1 (because we introduced x j ), d i = 0 if i j or if i is nonbasic, and d b = - B -1 A j ; and the value of θ * ends up being:

    θ * = min allbasicindicesforwhichd B(i) <0 - x B(i) d B(i) : = - x B(l) d B(l) ;
  5. 现在专栏 A 退出基础B和列 A j 取代它。

  6. Now column A B(l) exits the basis B, and column A j replaces it.

  7. 重复这个过程,直到我们达到有限最优解(当没有 A j 从A的所有可用列中得出负数 C j ),或者发现问题是无界的,最优成本是 - 无穷大 。当我们有以下情况时就会发生这种情况 d 0 , 所以 y = X + θ d 0 ,无论多大都可行 θ 得到;从而推动 θ 无穷大 将不断降低成本 C y = C X + θ C j - C -1 A j 一直到 - 无穷大

  8. Repeat this process until we either reach a finite optimal solution (when no A j from all the available columns of A gives us a negative c j ), or discover that the problem is unbounded and the optimal cost is - . This happens when we have d 0 , so y = x + θ d 0 , making it feasible no matter how large θ gets; thus pushing θ to will keep reducing the cost c . y = c . x + θ ( c j - c B . B -1 A j ) all the way to - .

单纯形法注意事项

Notes on the simplex method

对于单纯形法需要记住的一些事情:

Some things to remember for the simplex method:

  • 步骤 4 给出了单纯形算法的两个终止标准:无负减少成本 C j ,或一组可行的降低成本方向的所有坐标 d 是非负的。

  • Step 4 gives the two termination criteria for the simplex algorithm: no negative reduced cost c ¯ j , or all the coordinates of a feasible set reducing cost direction d are nonnegative.

  • 如果可行集非空并且每个基本可行解都是非退化的,则单纯形法保证在有限多次迭代后终止,具有有限最优解或 - 无穷大 最优成本。

  • If the feasible set is nonempty and every basic feasible solution is nondegenerate, then the simplex method is guaranteed to terminate after finitely many iterations, with either a finite optimal solution or - optimal cost.

  • 假设一些基本可行解是退化的(一些基本变量也为零),我们最终得到其中之一。在这种情况下,当我们通过引入来改变基础时,有可能 A j 和做 A 出口,我们呆在同一个角落 y = X + 0 d (这发生在当 X = 0 , 所以 θ * = - X d = 0 )。在这种情况下,选择一个新的 A j 直到你真正搬离 X y = X + θ d , θ * > 0 。这里可能发生的一件非常糟糕的事情是,当我们停在 X 并不断改变基础(在 X )直到我们找到一个真正让我们远离的 X y = X + θ * d 在降低成本的方向上,我们最终可能会得到与开始算法相同的基础!这会导致循环,算法可能会无限循环。通过明智地选择A的哪些列将进入和退出基础,可以避免循环:一种系统的选择方法 A j 以及后来的B (l) θ * 当过程中存在联系时。

  • Suppose some of the basic feasible solutions are degenerate (some of the basic variables are also zero) and we end up at one of them. In this case, there is a chance that when we change the basis by introducing A j and make A B(l) exit, we stay at the same corner y = x + 0 d (this happens when x B(l) = 0 , so θ * = - x B(l) d B(l) = 0 ). In this case, choose a new A j until you actually move from x to y = x + θ d , θ * > 0 . One really bad thing that could happen here is that after we stop at x and keep changing basis (stalling for a little while at x ) until we find one that actually moves us away from x to y = x + θ * d in a cost-reducing direction, we might end up with the same basis we started the algorithm with! This will lead to cycling, and the algorithm may loop indefinitely. Cycling can be avoided by making smart choices for which columns of A will enter and exit the basis: a systematic way of choosing A j and later B(l) in θ * when there are ties in the process.

  • 当流程中存在联系时(我们有不止一种降低成本的选择 A j 这给了 C j < 0 ,和/或多个最小化指数B(l ) θ * ),我们可以设计规则来选择进入 A j 和/或退出 A 一步一步打着这样的领带。当存在这种联系时我们决定遵循的规则称为枢轴规则。

  • When there are ties in the process (we have more than one reducing cost option A j that gives c ¯ j < 0 , and/or more than one minimizing index B(l) for θ * ), we can devise rules to choose entering A j and/or exiting A B(l) at a step with such a tie. The rules we decide to follow when there are such ties are called pivoting rules.

  • 布兰德规则是一个非常简单且计算成本低廉的旋转规则:选择 A j 具有最小索引j C j < 0 输入基础,然后选择 A 以最小的合格指数B(l)退出基础。这个最小下标旋转规则可以帮助我们避免循环。还有其他旋转规则。

  • A very simple and computationally inexpensive pivoting rule is Bland’s rule: choose A j with the smallest index j for which c ¯ j < 0 to enter the basis, and choose A B(l) with the smallest eligible index B(l) to exit the basis. This smallest subscript pivoting rule helps us avoid cycling. There are other pivoting rules as well.

  • 如果n – m = 2(因此A仅比行多两列),则无论使用哪种旋转规则,单纯形法都不会循环。

  • If n – m = 2 (so A has only two more columns than rows), then the simplex method will not cycle no matter which pivoting rule is used.

  • 对于并非源自一般形式问题的问题,尤其是那些具有大量变量的问题,如何选择初始基础 B 和相关的基本可行解x可能并不总是显而易见的(因为不清楚哪m列)A是线性无关的)。在这种情况下,我们引入人工变量并求解辅助线性规划问题来确定原问题是否不可行,因此无解;或者,如果问题可行,则将人工变量排除在基础之外,并为我们的原始问题获得初始基础和相关的基本可行解。这个过程称为单纯形法的第一阶段。单纯形法的其余部分称为阶段 II

  • For problems that did not originate from a general form problem, especially those with a large number of variables, it might not always be obvious how to choose the initial basis B and associated basic feasible solution x (because it would not be clear which m columns of A are linearly independent). In this case, we introduce artificial variables and solve an auxiliary linear programming problem to determine whether the original problem is infeasible and hence there is no solution; or, if the problem is feasible, drive the artificial variables out of the basis and obtain an initial basis and an associated basic feasible solution for our original problem. This process is called Phase I of the simplex method. The rest of the simplex method is called Phase II.

  • big-M 方法结合了单纯形法的第一阶段和第二阶段。这里我们使用单纯形法来求解:

    分钟 AX = X 0,y 0 C X + 中号 y 1 + y 2 + + y

    对于足够大的M选择,如果原问题可行且其最优成本有限,则所有人工变量 y 1 , y 2 , , y 最终趋于零,这让我们回到了最初的问题。我们可以将M视为未确定的参数,并让减少的成本为M的函数,并在确定减少的成本是否为负时将M视为一个非常大的数。

  • The big-M method combines Phase I and Phase II of the simplex method. Here we use the simplex method to solve:

    min Ax =b x 0,y 0 ( c . x + M ( y 1 + y 2 + . . . + y m ) )

    For a sufficiently large choice of M, if the original problem is feasible and its optimal cost is finite, all of the artificial variables y 1 , y 2 , , y m are eventually driven to zero, which takes us back to our original problem. We can treat M as an undetermined parameter and let the reduced costs be functions of M, and treat M as a very large number when determining whether a reduced cost is negative.

修正的单纯形法

The revised simplex method

修正单纯形法是单纯形法的一种计算成本较低的实现。它提供了一种更便宜的计算方式 -1 通过利用旧基础B和新基础之间的关系 :它们只有一个不同的列(涉及的两个顶点是相邻的)。这样我们就可以得到新的 -1 从上一个 -1

The revised simplex method is a computationally less expensive implementation of the simplex method. It provides a cheaper way to compute B ¯ -1 by exploiting the relationship between the old basis B and the new basis B ¯ : they only have one different column (the two vertices involved are adjacent). So we can obtain the new B ¯ -1 from the previous B -1 .

以下是修订后的单纯形算法的典型迭代。这也将有助于强化上一节中的单纯形法步骤。请注意,为了简单起见,我们抑制了xybd的向量表示法:

The following is a typical iteration of the revised simplex algorithm. This will also help reinforce the simplex method steps from the previous section. Note that for simplicity, we supress the vector notation for the x’s, y’s, b’s, and d’s:

  1. 从A的m 个基本列和相关的基本可行解x组成的B开始 X = -1 , 和 X = 0 否则。

  2. Start with a B consisting of m basic columns from A and the associated basic feasible solution x with x B = B -1 b , and x i = 0 otherwise.

  3. 计算 -1 (这是 -1 不是单纯形法计算中出现的B )。

  4. Compute B -1 (it is B -1 not B that appears in the simplex method computations).

  5. 对于j非基本,计算减少的成本 C j = C j - C -1 A j (这将为您带来n – m 的成本降低)。

  6. For j nonbasic, compute the reduced costs c ¯ j = c j - c B . B -1 A j (this will give you n – m reduced costs).

  7. 如果所有的 C j 均为非负,则当前基本可行解x是最优的,算法以x为优化器、cx为最优成本终止(没有 A j 这可以进入基础并进一步降低成本)。

  8. If all the c ¯ j are nonnegative, the current basic feasible solution x is optimal, and the algorithm terminates with x as the optimizer and c.x as the optimal cost (there is no A j that could enter the basis and reduce the cost even more).

  9. 否则,选择一个j,其中 C j < 0 (布兰德的旋转规则告诉我们选择最小的j)。请注意,这使得 A j 输入基础。

  10. Else, choose a j for which c ¯ j < 0 (Bland’s pivoting rule tells us to choose the smallest such j). Note that this makes A j enter the basis.

  11. 计算可行方向 d d j = 1 , d = - -1 A j , 和 d = 0 否则。

    • 如果所有组件 d 是非负的,算法以最优成本终止 - 无穷大 并且没有优化器。

    • 否则,选择以下组件 d 是负数,令:

      θ * = 分钟 全部基本的指数为了哪个d <0 - X d : = - X d

      这一步计算 θ * 并将B(l)指定为现有列的索引。

  12. Compute a feasible direction d : d j = 1 , d B = - B -1 A j , and d i = 0 otherwise.

    • If all the components of d B are nonnegative, the algorithm terminates with optimal cost - and no optimizer.

    • Else, choose the components of d B that are negative, and let:

      θ * = min allbasicindicesforwhichd B(i) <0 - x B(i) d B(i) : = - x B(l) d B(l)

      This step computes θ * and assigns B(l) as the index of the exiting column.

  13. 计算新的基本可行解 y = X + θ * d (这个新的基本可行解对应于新的基 , 其中有 A j 代替 A B )中。

  14. Compute the new basic feasible solution y = x + θ * d (this new basic feasible solution corresponds to the new basis B ¯ , which has A j replace A B(l) in B).

  15. 这一步计算新的 -1 用于下一次迭代而不形成新的基础 然后将其反转:形成 × + 1 增广矩阵 -1 | -1 A j 。使用第 l行执行行操作(将第l行的倍数添加到每行),使最后一列成为单位向量 e ,除了第 l坐标中的1之外,其余位置均为零。结果的前m列是您的新结果 -1

    理由

    = -1 A j 并注意 -1 = e 1 e 2 e , 在哪里 e 是单位列向量,第l个条目为1,其他位置为 0,u是第l列。如果我们使用第l行执行行运算并将u变换为 e 。所有行操作都可以捆绑在一起,形成从左侧应用的可逆矩阵Q : -1 = 。现在右乘 -1 要得到 -1 = -1 。这意味着要获得 -1 ,执行于 -1 相同的行操作将把u转换为 e

  16. This step computes the new B ¯ -1 for the next iteration without forming the new basis B ¯ then inverting it: form the m × m + 1 augmented matrix ( B -1 | B -1 A j ) . Perform row operations using the lth row (add to each row a multiple of the lth row) to make the last column the unit vector e l , which is zero everywhere except for 1 in the lth coordinate. The first m columns of the result is your new B ¯ -1 .

    Justification

    Let u = B -1 A j and note that B -1 B ¯ = ( e 1 e 2 u e m ) , where e i is the unit column vector with 1 in the lth entry and zero everywhere else, and u is the lth column. The matrix becomes the identity matrix if we perform row operations using the lth row and transform u into e l . All row operations can be bundled together in an invertible matrix Q applied from the left: Q B -1 B ¯ = I . Now right-multiply by B ¯ -1 to get Q B -1 = B ¯ -1 . This means that to obtain B ¯ -1 , perform on B -1 the same row operations that will transform u to e l .

计算 -1 使用 -1 在修正的单纯形中

Calculating B ¯ -1 Using B -1 in the Revised Simplex

该方法并不是从A原来的m列开始,求逆;相反,它对先前计算的行进行行操作 -1 ,其中可能包括舍入误差。在多次迭代中执行此操作会累积这些错误,因此最好计算 -1 时不时直接从A的列中提取以避免错误积累。

This method does not start from the original m columns of A and finds the inverse; instead, it does row operations on the previously calculated B -1 , which could include roundoff errors. Doing this over many iterations will accumulate these errors, so it will be better to compute B ¯ -1 straight from the columns of A every now and then to avoid error accumulation.

单纯形法的完整表格实现

The full tableau implementation of the simplex method

单纯形法的完整表格实现具有仅存储和更新一个矩阵的优点。在这里,而不是维护和更新 -1 ,维护和更新 × n + 1 矩阵 X | -1 A = -1 | -1 A 。专栏 = -1 A j 对应进入基础的变量称为枢轴列。如果第l个基本变量退出基础,则第l行称为主元行。同时属于枢轴行和枢轴列的元素称为枢轴元素现在在表格顶部添加第零行,用于跟踪当前成本的负数 - C X = - C X = C -1 以及降低的成本 C - C -1 A 。所以画面看起来像:

The full tableau implementation of the simplex method has the advantage of only storing and updating one matrix. Here, instead of maintaining and updating B -1 , maintain and update the m × n + 1 matrix x B | B -1 A ) = ( B -1 b | B -1 A . The column u = B -1 A j corresponding to the variable entering the basis is called the pivot column. If the lth basic variable exits the basis, then the lth row is called the pivot row. The element belonging to both the pivot row and the pivot column is called the pivot element. Now add a zeroth row on top of your tableau that keeps track of the negative of the current cost - c . x = - c B . x B = c B . B -1 b and the reduced costs c - c B . B -1 A . So the tableau looks like:

- C -1 C - C -1 A -1 -1 A

或更扩展:

or more expanded:

- C X C 1 C n X 1 | | -1 A 1 -1 A n X | |

掌握一件有用的事情:提取 -1 B很容易从给定的单纯形画面中得到(就像电影《黑客帝国》中那样)。

One helpful thing that is nice to master: extracting B -1 and B easily from a given simplex tableau (like in the movie The Matrix).

单纯形法最有效的实现是修改后的单纯形法(内存使用量为 2 ,单次迭代的最坏情况时间是 n ,单次迭代的最佳情况时间是 2 ,而上述完整画面方法的所有措施都是 n ),但一切都取决于稀疏程度矩阵是。

The most efficient implementation of the simplex method is the revised simplex (memory usage is O ( m 2 ) , worst-case time for single iteration is O ( m n ) , best-case time for single iteration is O ( m 2 ) , while all of the above measures for the full tableau method are O ( m n ) ), but everything depends on how sparse the matrices are.

运输和分配问题

Transportation and Assignment Problems

运输分配问题是线性优化问题,我们可以将其表示为最小成本网络流问题。

Transportation and assignment problems are linear optimization problems that we can formulate as min cost network flow problems.

交通问题
Transportation problem

将产品分配到仓库,最大限度地降低成本。

Allocate products to warehouses, minimize costs.

赋值问题
Assignment problem

将受让人分配给任务,受让人的数量等于任务的数量,每个受让人执行一项任务。当受让人i执行任务j时存在成本。目标是选择使成本最小化的任务。一个例子是将 Uber 司机分配给客户,或将机器分配给任务。

Allocate assignees to tasks, number of assignees is equal to the number of tasks, and each assignee performs one task. There is a cost when assignee i performs task j. The objective is to select an assignment that minimizes the cost. One example is assigning Uber drivers to customers, or machines to tasks.

我们利用了所涉及的矩阵稀疏的事实,因此我们不必完全实现单纯形算法,只需一个特殊的简化版本即可解决分配和传输问题。这与解决任何最小成本流问题(包括运输和分配问题)的网络单纯形方法。运输和分配问题是最小流问题的特例。匈牙利方法对于分配问题来说是特殊的。由于它专门用于此,因此效率更高。这些特殊用途的算法包含在一些线性编程软件包中。

We exploit the fact that the involved matrices are sparse so we don’t have to do a full implementation of the simplex algorithm, only a special streamlined version that solves both the assignment and transportation problems. This is related to the network simplex method that solves any minimum cost flow problem, including both transportation and assignment problems. The transportation and assignment problems are special cases of the minimum flow problem. The Hungarian method is special for the assignment problem. Since it is specialized for this, it is more efficient. These special-purpose algorithms are included in some linear programming software packages.

对偶性、拉格朗日松弛、影子价格、最大-最小、最小-最大等等

Duality, Lagrange Relaxation, Shadow Prices, Max-Min, Min-Max, and All That

我们暗示并在本章前面讨论有限维约束优化和使用拉格朗日乘子放松约束时围绕对偶性的概念跳舞。当我们的约束问题是线性的或具有线性约束的二次问题时,对偶性确实很有帮助。它使我们可以选择解决以下优化问题手(原始)或另一个相关问题(其对偶),无论哪个更容易或更便宜,并得到相同的解决方案。通常,对于算法来说,拥有更多的决策变量(问题的维度)并不像拥有更多的约束那样费力。由于对偶问题翻转了决策变量和约束的角色,因此当我们有太多约束时,解决它而不是解决原始问题更有意义(这里的另一种方法是使用对偶单纯形法来解决原始问题,我们将讨论该方法大约很快)。对偶问题的另一个帮助是,它有时提供解决原始问题的捷径。一个可行的向量 X 如果恰好存在可行向量,则原始问题最终将成为优化器 p 对于对偶问题,使得 C X = p

We hinted and danced around the idea of duality earlier in this chapter when discussing finite dimensional constrained optimization and relaxing the constraints using Lagrange multipliers. Duality is really helpful when our constrained problems are linear, or quadratic with linear constraints. It gives us the option to solve either the optimization problem at hand (the primal) or another related problem (its dual), whichever happens to be easier or less expensive, and get the same solution. Usually, having more decision variables (dimensions of the problem) is not as strenuous for an algorithm as having more constraints. Since the dual problem flips the roles of decision variables and constraints, then solving it instead of the primal problem makes more sense when we have too many constraints (another way here is using the dual simplex method to solve the primal problem, which we will talk about soon). Another way the dual problem helps is that it sometimes provides shortcuts to the solution of the primal problem. A feasible vector x to the primal problem will end up being the optimizer if there happens to be a feasible vector p to the dual problem, such that c . x = p . b .

在接下来的几段中学习二元性时,请以与图 10-8相同的方式来思考它:在原始领域中正在发生某些事情,在对偶领域(某些替代宇宙)中正在发生某种形式的相关阴影或回声。 ,两者在优化器处相遇,就像两个宇宙接触的大门。

When learning about duality in the next few paragraphs, think of it in the same way you see Figure 10-8: something is happening in the primal realm, some form of related shadow or echo is happening in the dual realm (some alternate universe), and the two meet at the optimizer, like a gate where the two universes touch.

电子邮件1008
图 10-8。二元性、影子问题、影子价格

因此,如果我们在一个宇宙中最大化,那么我们在另一个宇宙中就会最小化;如果我们在一个宇宙中的约束下做某事,我们就会对另一个宇宙中的决策变量做一些事情,反之亦然。

So if we are maximizing in one universe, we are minimizing in the other; if we are doing something with the constraints in one universe, we do something to the decision variables in the other, and vice versa.

对偶性拉格朗日乘子的动机

Motivation for duality-Lagrange multipliers

对于任何优化问题(线性或非线性):

For any optimization problem (linear or nonlinear):

分钟 X ε可行的 F X

代替寻找最小化者 X * 通过将梯度设置为零,寻找上限 F X * (很容易将可行集的任何元素插入 F X )并且对于下界 F X * (这是一个更难的不平等,通常需要巧妙的想法)。现在我们有下界 F X * 上限,所以我们收紧这些界限以更接近实际的解决方案 F X * 。我们通过最小化上限(这使我们回到原始的最小化问题)和最大化下限(这建立了对偶问题)来收紧边界。

Instead of finding the minimizer x * by setting the gradient equal to zero, look for an upper bound of f ( x * ) (easy by plugging any element of the feasible set into f ( x ) ) and for a lower bound of f ( x * ) (this is a harder inequality and usually requires clever ideas). Now we would have lower bound f ( x * ) upper bound, so we tighten these bounds to get closer to the actual solution f ( x * ) . We tighten the bounds by minimizing the upper bounds (this brings us back to the original minimization problem) and maximizing the lower bounds (this establishes the dual problem).

现在对于任何形式的线性最小化问题(标准形式、一般形式或两者都不是):

Now for a linear minimization problem in any form (standard form, general form, or neither):

分钟 线性限制条件X C X

给我们下限的聪明想法是什么 F X = C X ?我们寻找下界 F X = C X made up of a linear combination of the problem constraints. So we multiply each of our constraints by multipliers p i (Lagrange multipliers), choosing their signs in a way such that the constraint inequality is in the direction. How so? Well, the linear constraints are linear combinations of the entries of x , the objective function c . x is also a linear combination of the entries of x , a linear combination of a linear combination is still a linear combination, so we can totally pick a linear combination of the constraints that we can compare to c . x .

What is the clever idea that gives us lower bounds for f ( x ) = c . x ? We look for lower bounds for f ( x ) = c . x made up of a linear combination of the problem constraints. So we multiply each of our constraints by multipliers p i (Lagrange multipliers), choosing their signs in a way such that the constraint inequality is in the direction. How so? Well, the linear constraints are linear combinations of the entries of x , the objective function c . x is also a linear combination of the entries of x , a linear combination of a linear combination is still a linear combination, so we can totally pick a linear combination of the constraints that we can compare to c . x .

Namely, if we have m linear constraints, we need:

Namely, if we have m linear constraints, we need:

p 1 b 1 + p 2 b 2 + + p m b m c . x

The sign of a multiplier p i would be free if the constraint has an equality. Once we have these lower bounds, we tighten them by maximizing on p i , which gives us the dual problem.

The sign of a multiplier p i would be free if the constraint has an equality. Once we have these lower bounds, we tighten them by maximizing on p i , which gives us the dual problem.

Finding the dual linear optimization problem from the primal linear optimization problem

Finding the dual linear optimization problem from the primal linear optimization problem

It is important to get the sizes of the inputs to a linear optimization problem right. The inputs are: A, which is m × n , c , which is n × 1 , and b , which is m × 1 . The decision variables in the primal problem are in the vector x , which is n × 1 . The decision variables in the dual problem are in the vector p , which is m × 1 .

It is important to get the sizes of the inputs to a linear optimization problem right. The inputs are: A, which is m × n , c , which is n × 1 , and b , which is m × 1 . The decision variables in the primal problem are in the vector x , which is n × 1 . The decision variables in the dual problem are in the vector p , which is m × 1 .

In general, if A appears in the primal problem, then A t appears in the dual problem. So in the primal problem, we have the dot product of the rows of A and x . In the dual problem, we have the dot product of the columns of A and p . If the linear optimization problem is in any form, it’s easy to write its dual following this process:

In general, if A appears in the primal problem, then A t appears in the dual problem. So in the primal problem, we have the dot product of the rows of A and x . In the dual problem, we have the dot product of the columns of A and p . If the linear optimization problem is in any form, it’s easy to write its dual following this process:

  • If the primal is a minimization, then the dual is a maximization and vice versa.

  • If the primal is a minimization, then the dual is a maximization and vice versa.

  • The primal cost function is c . x , and the dual cost function is p . b .

  • The primal cost function is c . x , and the dual cost function is p . b .

  • In a minimization primal problem, we separate the constraints into two types:

    Type one

    Constraints telling us about the sign of the decision variable, for example:

    • x 3 0 , then in the dual this will correspond to A 3 . p c 3 , where A 3 is the third column of A and c 3 is the third entry of c .

    • x 12 0 , then in the dual this will correspond to A 12 . p c 12 , where A 12 is the 12th column of A and c 12 is the 12th entry of c .

    • x 5 is free, meaning has no specified sign. Then in the dual this will correspond to A 5 . p = c 5 , where A 5 is the fifth column of A and c 5 is the fifth entry of c .

    Type two

    Constraints of the form a i . x = b i , where a i is the ith row of A. In the dual these will correspond to constraints on the sign of p i , for example:

    • a 2 . x b 2 , then in the dual this will correspond to p 2 0 .

    • a 7 . x b 7 , then in the dual this will correspond to p 5 0 .

    • a 8 . x = b 8 , then the sign of p 8 is free.

  • In a minimization primal problem, we separate the constraints into two types:

    Type one

    Constraints telling us about the sign of the decision variable, for example:

    • x 3 0 , then in the dual this will correspond to A 3 . p c 3 , where A 3 is the third column of A and c 3 is the third entry of c .

    • x 12 0 , then in the dual this will correspond to A 12 . p c 12 , where A 12 is the 12th column of A and c 12 is the 12th entry of c .

    • x 5 is free, meaning has no specified sign. Then in the dual this will correspond to A 5 . p = c 5 , where A 5 is the fifth column of A and c 5 is the fifth entry of c .

    Type two

    Constraints of the form a i . x = b i , where a i is the ith row of A. In the dual these will correspond to constraints on the sign of p i , for example:

    • a 2 . x b 2 , then in the dual this will correspond to p 2 0 .

    • a 7 . x b 7 , then in the dual this will correspond to p 5 0 .

    • a 8 . x = b 8 , then the sign of p 8 is free.

In particular, if the linear optimization problem is in standard form:

In particular, if the linear optimization problem is in standard form:

min Ax =b x 0 c . x ,

then its dual is:

then its dual is:

max p isfreeA T p c p . b .

If the linear optimization problem is in general form:

If the linear optimization problem is in general form:

min Ax b x isfree c . x

then its dual is:

then its dual is:

max p 0 A T p =c p . b

How to solve the dual problem? The simplex method solves the dual problem; however, now you move to basic feasible solutions that increase the cost rather than decrease the cost.

How to solve the dual problem? The simplex method solves the dual problem; however, now you move to basic feasible solutions that increase the cost rather than decrease the cost.

Derivation for the dual of a linear optimization problem in standard form

Derivation for the dual of a linear optimization problem in standard form

There is another way to think about deriving the dual problem, but for this one the linear problem has to be in its standard form. Here’s the idea of it: we relax the constraint A x = b but introduce Lagrange multipliers p (pay a penalty p when the constraint is violated). So:

There is another way to think about deriving the dual problem, but for this one the linear problem has to be in its standard form. Here’s the idea of it: we relax the constraint A x = b but introduce Lagrange multipliers p (pay a penalty p when the constraint is violated). So:

min Ax =b x 0 c . x

becomes:

becomes:

min x 0 c . x + p . ( b - A x ) = g ( p ) .

Now prove that g ( p ) is a lower bound for the original c . x * (this is the weak duality theorem), then maximize over p. The dual problem appears in the process.

Now prove that g ( p ) is a lower bound for the original c . x * (this is the weak duality theorem), then maximize over p. The dual problem appears in the process.

The strong duality theorem says that the min of the primal problem and the max of the dual problem are equal. Note that if the primal problem is unbounded, then the dual problem is infeasible; and if the dual problem is unbounded, then the primal problem is infeasible.

The strong duality theorem says that the min of the primal problem and the max of the dual problem are equal. Note that if the primal problem is unbounded, then the dual problem is infeasible; and if the dual problem is unbounded, then the primal problem is infeasible.

Farkas’ lemma is at the core of duality theory and has many economical and financial applications.

Farkas’ lemma is at the core of duality theory and has many economical and financial applications.

Dual simplex method

Dual simplex method

The dual simplex method solves the primal problem (not the dual problem) using duality theory. The main difference between the simplex method and the dual simplex method is that the regular simplex method starts with a basic feasible solution that is not optimal and moves toward optimality, while the dual simplex method starts with an infeasible solution that is optimal and works toward feasibility. The dual simplex method is like a mirror image of the simplex method.

The dual simplex method solves the primal problem (not the dual problem) using duality theory. The main difference between the simplex method and the dual simplex method is that the regular simplex method starts with a basic feasible solution that is not optimal and moves toward optimality, while the dual simplex method starts with an infeasible solution that is optimal and works toward feasibility. The dual simplex method is like a mirror image of the simplex method.

First, note that when we solve the primal problem using the simplex method, we obtain the optimal cost for the dual problem for free (equal to the primal optimal cost), but also, we can read off the solution (optimizer) to the dual problem from the final tableau for the primal problem. An optimal dual variable is nonzero only if its associated constraint in the primal is binding. This should be intuitively clear, since the optimal dual variables are the shadow prices (Lagrange multipliers) associated with the constraints. We can interpret these shadow prices as values assigned to the scarce resources (binding constraints), so that the value of these resources equals the value of the primal objective function. The optimal dual variables satisfy the optimality conditions of the simplex method. In the final tableau of the simplex method, the reduced costs of the basic variables must be zero. The optimal dual variables must be the shadow prices associated with an optimal solution.

First, note that when we solve the primal problem using the simplex method, we obtain the optimal cost for the dual problem for free (equal to the primal optimal cost), but also, we can read off the solution (optimizer) to the dual problem from the final tableau for the primal problem. An optimal dual variable is nonzero only if its associated constraint in the primal is binding. This should be intuitively clear, since the optimal dual variables are the shadow prices (Lagrange multipliers) associated with the constraints. We can interpret these shadow prices as values assigned to the scarce resources (binding constraints), so that the value of these resources equals the value of the primal objective function. The optimal dual variables satisfy the optimality conditions of the simplex method. In the final tableau of the simplex method, the reduced costs of the basic variables must be zero. The optimal dual variables must be the shadow prices associated with an optimal solution.

Another way is to think of the dual simplex method as a disguised simplex method solving the dual problem. However, we do so without explicitly writing the dual problem and applying the simplex method to maximize.

Another way is to think of the dual simplex method as a disguised simplex method solving the dual problem. However, we do so without explicitly writing the dual problem and applying the simplex method to maximize.

Moreover, the simplex method produces a sequence of primal basic feasible solutions (corners of the polyhedron); as soon as it finds one that is also dual feasible, the method terminates. On the other hand, the dual simplex method produces a sequence of dual basic feasible solutions; as soon as it finds one that is also primal feasible, the method terminates.

Moreover, the simplex method produces a sequence of primal basic feasible solutions (corners of the polyhedron); as soon as it finds one that is also dual feasible, the method terminates. On the other hand, the dual simplex method produces a sequence of dual basic feasible solutions; as soon as it finds one that is also primal feasible, the method terminates.

Example: Networks, linear optimization, and duality

Example: Networks, linear optimization, and duality

Consider the network in Figure 10-9. The numbers indicate the edge capacity, which is the maximum amount of flow that each edge can handle. The max flow problem is to send the maximum flow from the origin node to the destination node. Intuitively, the maximum flow through the network will be limited by the capacities that the edges can transmit. In fact, this observation underlies a dual problem that takes place: maximizing the flow through the network is equivalent to minimizing the total capacities of the edges such that if we cut, then we cannot get from the origin to the destination. This is the max-flow min-cut theorem.

Consider the network in Figure 10-9. The numbers indicate the edge capacity, which is the maximum amount of flow that each edge can handle. The max flow problem is to send the maximum flow from the origin node to the destination node. Intuitively, the maximum flow through the network will be limited by the capacities that the edges can transmit. In fact, this observation underlies a dual problem that takes place: maximizing the flow through the network is equivalent to minimizing the total capacities of the edges such that if we cut, then we cannot get from the origin to the destination. This is the max-flow min-cut theorem.

电子邮件1009
Figure 10-9. Duality: maximum flow through the network is equal to the smallest cut capacity

Figure 10-9 shows the values of all the cuts (the set of edges that if we cut together we would not be able to get from the origin to the destination) through the network, along with the cut of minimal total edge capacity, which is 16. By the max-flow min-cut theorem, the max flow that we can send through the network would be 16: send y 1 = 12 units through the edge with capacity 19, and y 2 = 4 units through the edge with capacity 4. Of those, y 3 = 1 unit will flow through the edge with capacity 1, y 4 = 11 units will flow through the edge with capacity 11, and y 5 = 1 + 4 = 5 units will flow through the bottom edge with capacity 6. All 12 units will make their way to the destination through the last 2 edges connected to it, with y 6 = 0 (no units need to flow through the vertical edge with capacity 7), y 7 = 11 units flow through the rightmost edge with capacity 12, and y 8 = 5 units flow through the rightmost edge with capacity 6. The solution to the max flow problem is now ( y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 , y 8 ) = ( 12 , 4 , 1 , 11 , 5 , 0 , 11 , 5 ) .

Figure 10-9 shows the values of all the cuts (the set of edges that if we cut together we would not be able to get from the origin to the destination) through the network, along with the cut of minimal total edge capacity, which is 16. By the max-flow min-cut theorem, the max flow that we can send through the network would be 16: send y 1 = 12 units through the edge with capacity 19, and y 2 = 4 units through the edge with capacity 4. Of those, y 3 = 1 unit will flow through the edge with capacity 1, y 4 = 11 units will flow through the edge with capacity 11, and y 5 = 1 + 4 = 5 units will flow through the bottom edge with capacity 6. All 12 units will make their way to the destination through the last 2 edges connected to it, with y 6 = 0 (no units need to flow through the vertical edge with capacity 7), y 7 = 11 units flow through the rightmost edge with capacity 12, and y 8 = 5 units flow through the rightmost edge with capacity 6. The solution to the max flow problem is now ( y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 , y 8 ) = ( 12 , 4 , 1 , 11 , 5 , 0 , 11 , 5 ) .

To formulate this network problem as a linear optimization problem (which we just solved graphically using our knowledge of the value of the minimal cut, which is the solution of the dual problem), we need to add one more fictional edge with flow value y 9 that connects the destination to the origin, and assume that the flow that gets to the destination fictionally finds its way back to the origin. In other words, we close the circuit, and apply Kirchhoff’s current law, which says the sum of currents in a network of conductors meeting at a point is zero, or the flow into a node is equal to the flow out of it. The linear maximization problem now becomes:

To formulate this network problem as a linear optimization problem (which we just solved graphically using our knowledge of the value of the minimal cut, which is the solution of the dual problem), we need to add one more fictional edge with flow value y 9 that connects the destination to the origin, and assume that the flow that gets to the destination fictionally finds its way back to the origin. In other words, we close the circuit, and apply Kirchhoff’s current law, which says the sum of currents in a network of conductors meeting at a point is zero, or the flow into a node is equal to the flow out of it. The linear maximization problem now becomes:

max Ay =0 |y i |M i y 9

where A (Figure 10-10) is the incidence matrix of our network, y = (y 1 ,y 2 ,y 3 ,y 4 ,y 5 ,y 6 ,y 7 ,y 8 ,y 9 ) t is the vector of signed maximum flow (we allow the y values to be negative so that the flow in would cancel the flow out) that we can send through each edge and that we need to solve for (we just found its solution without the signs by inspection using the minimal cut intuition), M i is the max capacity of each edge in the network, and the condition A y = 0 guarantees that the flow into a node is equal to the flow out of a node. Of course, in this case the network will have directed edges showing in which direction the optimal flow will go through each edge.

where A (Figure 10-10) is the incidence matrix of our network, y = (y 1 ,y 2 ,y 3 ,y 4 ,y 5 ,y 6 ,y 7 ,y 8 ,y 9 ) t is the vector of signed maximum flow (we allow the y values to be negative so that the flow in would cancel the flow out) that we can send through each edge and that we need to solve for (we just found its solution without the signs by inspection using the minimal cut intuition), M i is the max capacity of each edge in the network, and the condition A y = 0 guarantees that the flow into a node is equal to the flow out of a node. Of course, in this case the network will have directed edges showing in which direction the optimal flow will go through each edge.

电子邮件1010
Figure 10-10. Incidence matrix of the network in Figure 10-9

Now that we have a linear formulation of the max flow problem, we can write its dual easily using the methods we learned in this chapter (the minimum cut problem), and solve either the primal or the dual. Note that all we need for this formulation is the incidence matrix of the network, the edge capacities, and the Kirchhoff’s condition that the flow into a node is equal to the flow out of the node.

Now that we have a linear formulation of the max flow problem, we can write its dual easily using the methods we learned in this chapter (the minimum cut problem), and solve either the primal or the dual. Note that all we need for this formulation is the incidence matrix of the network, the edge capacities, and the Kirchhoff’s condition that the flow into a node is equal to the flow out of the node.

Example: Two-person zero-sum games, linear optimization, and duality

Example: Two-person zero-sum games, linear optimization, and duality

Another relevant setting where duality and linear optimization are built in is the two-person zero-sum game from game theory. A gain to one player is a loss to the other (hint: duality). To articulate the problem mathematically, we need the payoff matrix for all the options in the game for each of p l a y e r 1 and p l a y e r 2 . Each player wants to devise a strategy that maximizes their pay given their options (no one said that the payoff matrices for the games have to be fair). We need to solve for the optimal strategy for each player. If we set up the optimization problem for p l a y e r 1 , we do not start from scratch to get the optimization problem for the strategy of p l a y e r 2 : we just write its dual. The sum expected payoff of the game will be the same for both players, assuming that both of them act rationally and follow their optimal strategies.

Another relevant setting where duality and linear optimization are built in is the two-person zero-sum game from game theory. A gain to one player is a loss to the other (hint: duality). To articulate the problem mathematically, we need the payoff matrix for all the options in the game for each of p l a y e r 1 and p l a y e r 2 . Each player wants to devise a strategy that maximizes their pay given their options (no one said that the payoff matrices for the games have to be fair). We need to solve for the optimal strategy for each player. If we set up the optimization problem for p l a y e r 1 , we do not start from scratch to get the optimization problem for the strategy of p l a y e r 2 : we just write its dual. The sum expected payoff of the game will be the same for both players, assuming that both of them act rationally and follow their optimal strategies.

Consider, for example, the payoff matrix in Figure 10-11. The game goes like this: p l a y e r 1 chooses a row and p l a y e r 2 chooses a column at the same time. p l a y e r 1 pays the number in the chosen row and column to p l a y e r 2 . Therefore, p l a y e r 1 wants to minimize and p l a y e r 2 wants to maximize. The players repeat the game many times.

Consider, for example, the payoff matrix in Figure 10-11. The game goes like this: p l a y e r 1 chooses a row and p l a y e r 2 chooses a column at the same time. p l a y e r 1 pays the number in the chosen row and column to p l a y e r 2 . Therefore, p l a y e r 1 wants to minimize and p l a y e r 2 wants to maximize. The players repeat the game many times.

What is the optimal strategy for each of p l a y e r 1 and p l a y e r 2 and what is the expected payoff of the game?

What is the optimal strategy for each of p l a y e r 1 and p l a y e r 2 and what is the expected payoff of the game?

电子邮件1011
Figure 10-11. Payoff matrix

To find the optimal strategy, suppose p l a y e r 1 chooses r o w 1 , with probability x 1 and r o w 2 with probability x 2 . Then x 1 + x 2 = 1 , 0 x 1 1 , and 0 x 2 1 . p l a y e r 1 rationalizes that if they use an ( x 1 , x 2 ) mixed strategy, there would be another row in the payoff matrix corresponding to this new strategy (see Figure 10-11). Now p l a y e r 1 knows that p l a y e r 2 wants to choose the column that maximizes their payoff, so p l a y e r 1 must choose ( x 1 , x 2 ) that makes the worst payoff (maximum of the third row) as small as possible. Therefore, p l a y e r 1 must solve the min-max problem:

To find the optimal strategy, suppose p l a y e r 1 chooses r o w 1 , with probability x 1 and r o w 2 with probability x 2 . Then x 1 + x 2 = 1 , 0 x 1 1 , and 0 x 2 1 . p l a y e r 1 rationalizes that if they use an ( x 1 , x 2 ) mixed strategy, there would be another row in the payoff matrix corresponding to this new strategy (see Figure 10-11). Now p l a y e r 1 knows that p l a y e r 2 wants to choose the column that maximizes their payoff, so p l a y e r 1 must choose ( x 1 , x 2 ) that makes the worst payoff (maximum of the third row) as small as possible. Therefore, p l a y e r 1 must solve the min-max problem:

min 0x 1 10x 2 1x 1 +x 2 =1 max { x 1 + 3 x 2 , - x 2 , 4 x 1 + 2 x 2 }

Recall that the maximum of linear functions is a convex piecewise linear function. We can easily change such a min-max(linear functions) problem to a linear minimization problem:

Recall that the maximum of linear functions is a convex piecewise linear function. We can easily change such a min-max(linear functions) problem to a linear minimization problem:

min zx 1 +3x 2 z-x 2 z4x 1 +2x 2 0x 1 10x 2 1x 1 +x 2 =1 z

Figure 10-12 shows the formulation of the dual of this problem, and Figure 10-13 shows that this is exactly the problem that p l a y e r 2 is trying to solve.

Figure 10-12 shows the formulation of the dual of this problem, and Figure 10-13 shows that this is exactly the problem that p l a y e r 2 is trying to solve.

电子邮件1012
Figure 10-12. The dual of p l a y e r 1 ’s problem
电子邮件1013
Figure 10-13. The dual of p l a y e r 1 ’s min-max problem is the same as p l a y e r 2 ’s max-min problem

Note that the constraints y 1 1 , y 2 1 , and y 3 1 are redundant since all the y’s are nonnegative and they add up to 1. Similarly with the constraints x 1 1 and x 2 1 . This happens a lot when formulating linear optimization problems.

Note that the constraints y 1 1 , y 2 1 , and y 3 1 are redundant since all the y’s are nonnegative and they add up to 1. Similarly with the constraints x 1 1 and x 2 1 . This happens a lot when formulating linear optimization problems.

Solving either the primal or the dual problem, we find each player’s optimal strategy: p l a y e r 1 must go with the first row x 1 = 0 . 25 of the time, and with the second row x 2 = 0 . 75 of the time, for an expected payoff of 2.5, which means p l a y e r 1 expects to lose no more than 2.5 with this strategy. p l a y e r 2 must go with the first column y 1 = 0 . 5 of the time, and with the third column y 3 = 0 . 5 of the time (never with the second column y 2 = 0 ), for an expected payoff of 2.5, which means p l a y e r 2 expects to gain no less than 2.5 with this strategy.

Solving either the primal or the dual problem, we find each player’s optimal strategy: p l a y e r 1 must go with the first row x 1 = 0 . 25 of the time, and with the second row x 2 = 0 . 75 of the time, for an expected payoff of 2.5, which means p l a y e r 1 expects to lose no more than 2.5 with this strategy. p l a y e r 2 must go with the first column y 1 = 0 . 5 of the time, and with the third column y 3 = 0 . 5 of the time (never with the second column y 2 = 0 ), for an expected payoff of 2.5, which means p l a y e r 2 expects to gain no less than 2.5 with this strategy.

Quadratic optimization with linear constraints, Lagrangian, min-max theorem, and duality

Quadratic optimization with linear constraints, Lagrangian, min-max theorem, and duality

A nonlinear optimization problem that has a nice structure, appears in all kinds of applications, and has a lot to teach us on how things tie together is a quadratic problem with linear constraints:

A nonlinear optimization problem that has a nice structure, appears in all kinds of applications, and has a lot to teach us on how things tie together is a quadratic problem with linear constraints:

min Ax =b 1 2 x t S x

Here, S is a symmetric and positive semidefinite matrix, which means that its eigenvalues are nonnegative. For high dimensions, this plays the role of keeping the objective function convex and bounded below, or shaped like the bowl of the one-dimensional function f ( x ) = x 2 .

Here, S is a symmetric and positive semidefinite matrix, which means that its eigenvalues are nonnegative. For high dimensions, this plays the role of keeping the objective function convex and bounded below, or shaped like the bowl of the one-dimensional function f ( x ) = x 2 .

For example, this is a two-dimensional quadratic optimization problem with one linear constraint:

For example, this is a two-dimensional quadratic optimization problem with one linear constraint:

min a 1 x 1 +a 2 x 2 =b 1 2 ( s 1 x 1 2 + s 2 x 2 2 )

Here, S = s 1 0 0 s 2 where the s entries are nonnegative, and A = a 1 a 2 . Inspecting this problem, we are searching for the point ( x 1 , x 2 ) on the straight line a 1 x 1 + a 2 x 2 = b that minimizes the quantity f ( x ) = s 1 x 1 2 + s 2 x 2 2 . The level sets of the objective function s 1 x 1 2 + s 2 x 2 2 = k are concentric ellipses that cover the whole 2 plane. The winning ellipse (the one with the smallest level set value) is the one that is tangent to the straight line at the winning point (Figure 10-14). At this point, the gradient vector of the ellipse and the gradient vector of the constraint align, which is exactly what Lagrange multiplier formulation gives us: to formulate the Lagrangian, relax the constraint, but pay a penalty equal to the Lagrangian multiplier p times how much we relaxed it in the objective function, minimizing the unconstrained problem.

Here, S = s 1 0 0 s 2 where the s entries are nonnegative, and A = a 1 a 2 . Inspecting this problem, we are searching for the point ( x 1 , x 2 ) on the straight line a 1 x 1 + a 2 x 2 = b that minimizes the quantity f ( x ) = s 1 x 1 2 + s 2 x 2 2 . The level sets of the objective function s 1 x 1 2 + s 2 x 2 2 = k are concentric ellipses that cover the whole 2 plane. The winning ellipse (the one with the smallest level set value) is the one that is tangent to the straight line at the winning point (Figure 10-14). At this point, the gradient vector of the ellipse and the gradient vector of the constraint align, which is exactly what Lagrange multiplier formulation gives us: to formulate the Lagrangian, relax the constraint, but pay a penalty equal to the Lagrangian multiplier p times how much we relaxed it in the objective function, minimizing the unconstrained problem.

( x ; p ) = f ( x ) + p ( b - g ( x ) ) = s 1 x 1 2 + s 2 x 2 2 + p ( b - a 1 x 1 - a 2 x 2 )

When we minimize the Lagrangian, we set its gradient equal to zero, and that leads to f ( x ) = p g ( x ) . This says that the gradient vector of the objective function is parallel to the gradient vector of the constraint at the optimizing point(s). Since the gradient vector of any function is perpendicular to its level sets, the constraint is in fact tangent to the level set of the objective function at the minimizing point(s). Therefore, to find the optimizing point(s), we look for the level sets of the objective function where it happens to be tangent to the constraint.

When we minimize the Lagrangian, we set its gradient equal to zero, and that leads to f ( x ) = p g ( x ) . This says that the gradient vector of the objective function is parallel to the gradient vector of the constraint at the optimizing point(s). Since the gradient vector of any function is perpendicular to its level sets, the constraint is in fact tangent to the level set of the objective function at the minimizing point(s). Therefore, to find the optimizing point(s), we look for the level sets of the objective function where it happens to be tangent to the constraint.

电子邮件1014
Figure 10-14. The level sets of the quadratic function x 1 2 + 4 x 2 are concentric ellipses; each of them has a constant value. When we impose the linear constraint x 1 + x 2 = 2 . 5 , we get the optimizer (2,0.5) at exactly the point where one of the level sets is tangent to the constraint. The value of the optimal level set is x 1 2 + 4 x 2 = 5 .

Another example that helps us visualize Lagrangian and the upcoming min-max theorem is a trivial one-dimensional example:

Another example that helps us visualize Lagrangian and the upcoming min-max theorem is a trivial one-dimensional example:

min x=1 x 2

The Lagrangian is ( x ; p ) = x 2 - p ( 1 - x ) . We use this toy example whose optimizer is obviously x = 1 with minimal value 1 so that we can visualize the Lagrangian. Recall that the Lagrange formulation makes the dimension jump up; in this case we have one constraint, so the dimension increases from one to two, and in our limited three-dimensional world we can only visualize functions of two variables (x and p). Figure 10-15 shows the landscape of our trivial Lagrangian function, which is now representative for Lagrangian formulations for quadratic optimization problems with linear constraints. The main thing to pay attention to in Figure 10-15 is that the optimizers of these kinds of problems ( x * ; p * ) happen at saddle points of the Lagrangian. These are points where the second derivative is positive in one variable and negative in another, so the landscape of the Lagrangian function is convex in one variable (x) and concave in the other (p).

The Lagrangian is ( x ; p ) = x 2 - p ( 1 - x ) . We use this toy example whose optimizer is obviously x = 1 with minimal value 1 so that we can visualize the Lagrangian. Recall that the Lagrange formulation makes the dimension jump up; in this case we have one constraint, so the dimension increases from one to two, and in our limited three-dimensional world we can only visualize functions of two variables (x and p). Figure 10-15 shows the landscape of our trivial Lagrangian function, which is now representative for Lagrangian formulations for quadratic optimization problems with linear constraints. The main thing to pay attention to in Figure 10-15 is that the optimizers of these kinds of problems ( x * ; p * ) happen at saddle points of the Lagrangian. These are points where the second derivative is positive in one variable and negative in another, so the landscape of the Lagrangian function is convex in one variable (x) and concave in the other (p).

电子邮件1015
Figure 10-15. The optimizer of the constrained problem happens at the saddle point of the Lagrangian (note that the minimum of the Lagrangian itself is - ,但这不是我们关心的,因为我们关心的是具有线性约束的二次函数的优化器)

定位拉格朗日鞍点的一种方法(为我们提供相应约束问题的优化器)是解决 X ; p = 0 对于xp,但这是适用于简单问题(例如手头的琐碎问题)或小问题的强力方法。找到这些鞍点的另一种方法是最小化x,然后最大化p图 10-16)。另一种方法是最大化p然后最小化x。最小-最大定理表明这两条路径是相同的。

One way to locate the saddle points of the Lagrangian (which give us the optimizers of the corresponding constrained problems) is to solve ( x ; p ) = 0 for x and p, but that is the brute-force way that works for simple problems (like the trivial one at hand) or for small problems. Another way to find these saddle points is to minimize in x then maximize in p (Figure 10-16). Yet another way is to maximize in p then minimize in x. The min-max theorem says that these two paths are the same.

电子邮件1016
图 10-16。最小化 x,然后最大化p给出 最大限度 p X * p = X * p * = 鞍点。这将给出与最大化p,然后最小化x相同的答案。

因此,在鞍点 X * , p * , 我们有 X ; p = 0 (这与 X;p X = 0 X;p p = 0 ),以及

Therefore, at the saddle point ( x * , p * ) , we have ( x ; p ) = 0 (which is the same as (x;p) x = 0 and (x;p) p = 0 ), and

分钟 X 最大限度 p,抓住X固定的 X ; p = 最大限度 p 分钟 X,抓住p固定的 X ; p

我们绕了一圈再次证明,伴随着x中的约束最小化问题,我们在拉格朗日乘子p中还有另一个约束最大化问题。拉格朗日乘数、对偶性和约束优化之间的相互作用得到了充分展示。

We have gone full circle to yet again demonstrate that accompanying in a constrained minimization problem in x we have another constrained maximization problem in the Lagrange multipliers p. The interplay between Lagrange multipliers, duality, and constrained optimization is on full display.

现在我们已经了解了重要的想法,让我们回顾一下,将它们置于本小节开始时具有线性约束的高维二次问题的背景中:

Now that we have gone through the important ideas, let’s go back and put them in the context of the higher-dimensional quadratic problem with linear constraints that we started this subsection with:

分钟 AX = 1 2 X t S X

其中S是对称正定矩阵。具有宽松约束的拉格朗日公式为:

where S is a symmetric and positive definite matrix. The Lagrange formulation with relaxed constraints is:

分钟 X ; p = 分钟 1 2 X t S X + p - A X

解决这个无约束问题,是否通过设置 X ; p = 0 ,或者通过最小化超过 X 然后最大化 p ,或者通过最大化 p 然后最小化 X ,我们得到相同的解决方案 X * ; p * ,它发生在我们的高维拉格朗日函数的鞍点,并给出目标函数的最优值(这个问题结构简单的优点是我们可以手工求解):

Solving this unconstrained problem, whether by setting ( x ; p ) = 0 , or by minimizing over x then maximizing over p , or by maximizing over p then minimizing over x , we get the same solution ( x * ; p * ) , which happens at the saddle point of our high-dimensional Lagrangian, and gives the optimal value of the objective function (the advantage of this problem with simple structure is that we can solve it by hand):

最低限度 成本 F = 1 2 AS -1 A t -1

此外,最优影子价格为: p * = dF d = AS -1 A t -1

Moreover, the optimal shadow prices are: p * = df db = (AS -1 A t ) -1 b .

我们在这里需要学习的最后一件事是高维鞍点的表征。对于一维约束问题,其标志是拉格朗日函数(它是xp的函数)的二阶导数在一个变量中为负,在另一个变量中为正。高维类比是这样的:Hessian矩阵(二阶导数矩阵)的特征值在一组变量中为负,在另一组变量中为正,因此在一组变量中是凹的,在另一组中是凸的。我们的讨论适用于优化任何在一组变量中是凸的而在另一组中是凹的高维目标函数。这是景观具有鞍点的标志。对于拉格朗日函数,鞍点正是约束问题达到最小值的位置。

The last thing we need to learn here is the characterization of saddle points in higher dimensions. For the one-dimensional constrained problem, the hallmark was having the second derivative of the Lagrangian (which was a function of x and p) negative in one variable and positive in the other. The high-dimensional analog is this: the eigenvalues of the Hessian matrix (matrix of second derivatives) is negative in one set of variables and positive in the other set of variables, so it is concave in one set of variables and convex in the other. Our discussion applies to optimizing any higher-dimensional objective function that is convex in one set of variables and concave in the other. This is the hallmark that the landscape has saddle points. For the Lagrangian function, the saddle point is exactly where the constrained problem attains its minimum.

这是否适用于运筹学中随处可见的具有线性约束的线性优化问题?是的,只要我们对问题中的所有系数都有正确的符号,例如我们在前面看到的最大流最小割和两人零和游戏示例小节。

Does this apply to linear optimization problems with linear constraints, which are everywhere in operations research? Yes, as long as we have the correct signs for all the coefficients in the problem, such as the max-flow min-cut and the two-person zero-sum game examples we saw in the previous subsections.

灵敏度

Sensitivity

在这里,我们关心优化问题及其解决方案相对于输入数据变化的敏感性。也就是说,最优解会发生什么情况 X * 和最优成本 C X * 如果我们稍微改变一下 C A ?我们能否从旧的最优解中得到新的最优解呢?在什么条件下我们可以这样做?以下是敏感性分析涉及的一些重要案例:

Here we care about the sensitivity of the optimization problem and its solution with respect to changes in its input data. That is, what happens to the optimal solution x * and the optimal cost c . x * if we slightly change c or A or b ? Can we obtain the new optimal solution from the old one? Under what conditions can we do that? These are some important cases that sensitivity analysis addresses:

  • 我们已经解释了最优 p 在对偶问题中作为边际价格的向量。这与敏感性分析有关:最优成本相对于约束值的变化率。

  • We have already interpreted the optimal p in the dual problem as the vector of marginal prices. This is related to sensitivity analysis: the rate of change of the optimal cost with respect to the constraint value.

  • 如果我们添加一个新的决策变量,我们会检查其降低的成本,如果它是负数,我们会向表中添加一个新列并从那里继续。

  • If we add a new decision variable, we check its reduced cost, and if it is negative, we add a new column to the tableau and proceed from there.

  • 如果一个条目 或者 C 被改变为 δ ,我们得到的值区间为 δ 相同的基础仍然是最佳的。

  • If an entry of b or c is changed by δ , we obtain an interval of values of δ for which the same basis remains optimal.

  • 如果A的条目被更改 δ ,类似的分析是可能的。但是,如果更改影响基本列的条目,则这种情况会有些复杂。

  • If an entry of A is changed by δ , a similar analysis is possible. However, this case is somewhat complicated if the change affects an entry of a basic column.

一般来说,如果我们有一个函数,并且我们希望它对其输入之一的变化敏感,那么这类似于询问其相对于该输入(在特定状态)的一阶导数,或离散一阶导数(有限差分)在该状态。使敏感性问题变得更有趣的事实是,我们正在处理受限问题并检查各种输入的微小变化的影响。问题。

In general, if we have a function and we want its sensitivity to variations with respect to one of its inputs, then that is similar to asking about its first derivative with respect to that input (at a certain state), or a discrete first derivative (finite difference) at that state. What makes sensitivity questions more interesting here is the fact that we are dealing with constrained problems and checking the effect of small variations in all kinds of inputs to the problem.

博弈论和多智能体

Game Theory and Multiagents

博弈论对于经济、政治、军事行动、多智能体人工智能以及基本上对存在对手或竞争对手的任何环境进行建模非常重要,我们必须在这些条件下做出决策或制定战略。我们的最佳策略很大程度上受到对手策略的影响,无论我们了解对手的策略还是只是推测它们。

Game theory is so important for economics, politics, military operations, multiagent AI, and basically for modeling any environment where there are adversaries or competitors and we have to make decisions or strategize in these conditions. Our optimal strategies are heavily influenced by our adversaries’ strategies, whether we know them or are only speculating about them.

最简单、最容易理解的博弈论设置是两人零和博弈,我们在讨论对偶性时看到了这一点。这里有两个竞争实体,其中一个实体的损失就是另一个实体的胜利,例如,两个政治运动或两个竞争公司。将理论扩展到更复杂的现实生活中一直是一个挑战,其中存在许多竞争对手,彼此之间具有不同的优势和劣势,不同程度的合作以及许多相互关联的策略。理论所能准确描述和分析的情况与现实生活中的情况还有差距。进展正在发生,许多研究人员都在研究这个案例,因为这样一个完整的理论将为世界带来令人难以置信的好处。想象一下,能够从上方查看整个对手网络,包括他们的动作、联系、可能的策略及其后果。

The easiest and most well-understood game theory setting is that of two-person zero-sum games, which we saw when discussing duality. Here, there are two competing entities, where the loss of one entity is the win of the other, for example, two political campaigns or two competing firms. It has been a challenge to extend the theory to more complex real-life situations with many competitors with varying advantages and disadvantages over each other, varying degrees of cooperation, along with many interrelated strategies. There is still a gap between the situations that the theory can accurately portray and analyze and real-life situations. Progress is happening, and many researchers are on this case, due to the incredible benefits such a complete theory would bring to the world. Imagine being able to view the whole network of adversaries from above, with their movements, connections, possible strategies, and their consequences.

对于多主体环境,博弈论为每个涉及的主体(玩家、公司、国家、军队、政治运动等)的理性行为或决策过程建模。从这个意义上说,多智能体的博弈论类似于单智能体的决策理论。

For multiagent environments, game theory models the rational behavior or decision-making process for each involved agent (player, firm, country, military, political campaign, etc.). In this sense, game theory for multiagents is similar to decision theory for a single agent.

非合作博弈论(代理人独立做出决策)最重要的概念是纳什均衡:游戏的策略大纲,其中每个代理没有动机偏离大纲规定的策略。也就是说,如果代理人偏离策略,情况会更糟,当然,假设每个人都理性行事。

The most important concept for noncooperative game theory (where the agents make their decisions independently) is that of the Nash equilibrium: a strategy outline for the game where each agent has no incentive to deviate from the outline’s prescribed strategy. That is, the agent will be worse off if they deviate from the strategy, of course, assuming everyone is acting rationally.

正如我们在对偶性部分中看到的,对于两人零和游戏,我们可以将它们建模为最小-最大问题,并使用最小-最大定理。我们还可以将它们建模为线性优化问题,其中一个玩家解决原始问题,另一个玩家解决对偶问题。这意味着我们可以为第一个玩家或第二个玩家设置优化问题。这两个问题最终都会得到相同的解决方案。这里我们给出了游戏中双方玩家所有策略的收益图,目标是找到使每个玩家的收益最大化(或损失最小化)的策略组合。直观地,我们可以看出为什么这个问题会产生对偶性。两个玩家相互竞争,每个玩家的最优策略都解决了原始问题和对偶问题。

As we saw in the section on duality, for two-person zero-sum games, we can model them as a min-max problem, and use the min-max theorem. We can also model them as a linear optimization problem, where one player is solving the primal problem, and the other is solving the dual problem. This means that we can either set up the optimization problem for the first player or for the second player. Both problems will end up with the same solution. Here we are given the payoff chart for all strategies in the game for both players, and the objective is to find the strategy combination that maximizes the pay (or minimizes the loss) for each player. Intuitively, we can see why duality is built into this problem. The two players are pushing against each other, and the optimal strategy for each player solves both the primal and the dual problems.

我们还可以使用图论和图论结果来分析两人游戏。这类似于我们如何将通过网络的最大流量表示为线性优化问题。最终,数学中的许多事物都紧密地联系在一起,最令人满意的感觉之一就是当我们理解这些联系时。

We can also use graphs and results from graph theory to analyze two-person games. This is similar to how we can formulate the max flow through a network as a linear optimization problem. Ultimately, many things in math connect neatly together, and one of the most satisfying feelings is when we understand these connections.

对于多智能体,某些技术可用于决策,包括:投票程序、分配稀缺资源的拍卖、达成协议的讨价还价以及任务共享的合同网络协议。在多智能体博弈的数学建模方面,我们将在第 13 章(人工智能和偏微分方程)中讨论Hamilton-Jacobi-Bellman偏微分方程微分方程。在这里,为了找到每个玩家的最优策略,我们必须求解游戏价值函数的高维 Hamilton-Jacobi-Bellman 型偏微分方程。在深度学习之前,这些类型的高维偏微分方程非常棘手,必须进行多次近似,或者不考虑所有参与实体。最近(2018),深度学习技术已被应用于求解这些高维偏微分方程,一旦重新表述为具有终端条件的向后随机微分方程(如果您不知道这意味着什么,请不要担心;它不是对于本章来说很重要)。

For multiagents, certain techniques are available for decision making, which include: voting procedures, auctions for allocating scarce resources, bargaining for reaching agreements, and contract net protocol for task sharing. In terms of mathematical modeling for multiagent games, we will discuss in Chapter 13 (AI and partial differential equations) the Hamilton-Jacobi-Bellman partial differential equation. Here, to find the optimal strategy for each player, we have to solve a high-dimensional Hamilton-Jacobi-Bellman type partial differential equation for the game’s value function. Before deep learning, these types of high-dimensional partial differential equations were intractable, and one had to make many approximations, or not consider all the participating entities. Recently (2018), a deep learning technique has been applied to solve these high-dimensional partial differential equations, once reformulated as a backward stochastic differential equation with terminal conditions (don’t worry if you do not know what this means; it is not important for this chapter).

在本书前面的第 8 章概率生成讨论生成对抗网络时,我们遇到了另一个两人对抗博弈论设置楷模。

We encountered another two-person adversarial game theoretic setting earlier in this book when discussing generative adversarial networks in Chapter 8 on probabilistic generative models.

排队

Queuing

队列是无处不在:机器的计算工作、造船厂的服务队列、急诊室的队列、机场办理登机手续的队列以及当地星巴克的队列。精心设计的排队系统可以为不同的设施和整个经济节省宝贵的时间、精力和金钱。它们增强了我们的整体福祉。

Queues are everywhere: computing jobs for machines, service queues at a shipyard, queues at the emergency room, airport check-in queues, and queues at the local Starbucks. Well-designed queue systems save different facilities and our entire economy an invaluable amount of time, energy, and money. They enhance our overall well-being.

队列数学建模的目的是确定适当的服务级别以最大限度地减少等待时间。该模型可能包括优先级规则,这意味着存在优先级组,并且成员获得服务的顺序取决于其优先级组。它还可能包括顺序或并行发生的不同类型的服务,或者一些顺序发生而另一些并行发生(例如,在船舶维护设施中)。有些模型包括多个服务设施——排队网络。

Mathematical modeling of queues has the objective of determining the appropriate level of service to minimize waiting times. The model might include a priority discipline, which means that there are priority groups, and the order in which the members get serviced depends on their priority groups. It might also include different type of services that happen sequentially or in parallel, or some in sequence and others in parallel (for example, in a ship maintenance facility). Some models include multiple service facilities—a queuing network.

关于排队论的论文有数千篇。认识排队数学模型的基本要素非常重要:

There are thousands of papers on queuing theory. It is important to recognize the basic ingredients of a queuing mathematical model:

  • 队列成员(顾客、船舶、工作人员、患者)在特定的到达时间到达。如果到达过程是随机的,那么数学模型必须根据数据或已知的对此类时间进行建模的数学分布来确定此到达间隔时间所遵循的概率分布。一些模型假设到达时间恒定。其他人假设指数分布(马尔可夫过程),因为它有助于数学分析并更好地模拟现实生活的过程。其他人假设Erlang分布,它允许不同时间间隔有不同的指数分布。其他人则假设更一般的分布。分布越普遍,数学分析就越困难。数值模拟永远是我们最好的朋友。

  • The members of the queue (customers, ships, jobs, patients) arrive at certain inter-arrival times. If the arrival process is random, then the math model must decide on a probability distribution that this inter-arrival time adheres to, either from the data or from mathematical distributions known to model such times. Some models assume constant arrival times. Others assume the exponential distribution (a Markovian process) as it facilitates the mathematical analysis and mimics the real-life process better. Others assume the Erlang distribution, which allows different exponential distributions for different time intervals. Others assume even more general distributions. The more general the distribution, the less easy the mathematical analysis. Numerical simulations are our best friend forever.

  • 可用服务器(并行和顺序)的数量:整数。

  • The number of servers (parallel and sequential) available: an integer.

  • 服务时间也遵循我们必须决定的某种概率分布。常见的分布与用于到达间隔时间的分布类似。

  • The service times also follow a certain probability distribution that we must decide on. Common distributions are similar to those used for inter-arrival times.

此外,数学模型还必须跟踪以下内容:

In addition, the mathematical model must also keep track of the following:

  • 完整排队系统中的初始成员数量(等待的成员和当前正在服务的成员)

  • The initial number of members in the full queuing system (those waiting and those currently being serviced)

  • 在给定的稍后时间,完整排队系统中有n 个成员的概率

  • The probability of having n members in the full queuing system at a given later time

最后,该模型想要计算排队系统的稳态

Finally, the model wants to compute the steady state of the queuing system:

  • 整个排队系统中有n个成员的概率

  • The probability of having n members in the full queuing system

  • 单位时间内预计新加入会员数

  • The expected number of new members arriving per unit time

  • 单位时间内完成服务的预期成员数

  • The expected number of members completing their service per unit time

  • 系统中每个成员的预期等待时间

  • The expected waiting time for each member in the system

成员以一定的平均速率进入队列,等待获得服务,以一定的平均速率获得服务,然后离开设施。数学模型必须量化这些平衡它们。

Members enter the queue at a certain mean rate, wait to get serviced, get serviced at a certain mean rate, then leave the facility. The mathematical model must quantify these and balance them.

存货

Inventory

随着当前供应链短缺,症状是杂货店货架空空,汽车维修配件、新车、家居装修材料短缺等等。供给与需求之间存在明显的差距。商店补充货品的时间间隔增加,导致积压、生产率低下和整体经济放缓。库存管理的数学模型量化供应(随机或确定性)和需求,并设计最佳库存策略来定时补货并决定每次补货所需的数量。理想情况下,该模型必须能够访问信息处理系统,该系统收集当前库存水平的数据,然后发出补充库存的时间和数量的信号。

With the current shortages in the supply chain, the symptoms are empty shelves at grocery stores, shortages of car repair parts, new cars, materials for home renovation, and many others. There is obviously a gap between supply and demand. The times between replenishing the supplies at stores have increased in a way that is causing backlogs, low productivity, and an overall slowed economy. Mathematical models for inventory management quantify the supply (stochastically or deterministically), the demand, and devise an optimal inventory policy for timing replenishing and deciding on the quantity required at each replenish. Ideally, the model must have access to an information processing system that gathers data on current inventory levels and then signals when and by how much to replenish them.

运筹学机器学习

Machine Learning for Operations Research

为了首先,与十年前相比,如今运筹学最令人兴奋的是解决大规模运筹学问题的能力,有时涉及数千万个约束和决策变量。为此,我们必须感谢计算能力的爆炸性增长以及运筹学算法的计算机实现的不断改进。

For starters, what is extremely exciting nowadays in operations research, as opposed to 10 years ago, is the ability to solve massive operations research problems, sometimes involving tens of millions constraints and decision variables. We have to thank the computational power explosion and the continuing improvement of computer implementations of operations research algorithms for this.

此外,机器学习可以帮助使用大量可用数据预测进入运筹学模型的许多参数的值。如果这些参数难以测量,建模者必须将它们从模型中删除,或者对其值做出假设。这种情况不必再发生了,因为更准确的机器学习模型能够考虑数千个变量。

Moreover, machine learning can help predict the values of many parameters that enter into operations research models using volumes of available data. If these parameters were hard to measure, modelers had to either remove them from the model or make assumptions about their values. This doesn’t have to be the case anymore because of more accurate machine learning models that are able to take thousands of variables into account.

最后,机器学习可以通过学习要关注空间的哪些部分或要优先考虑哪些子问题来帮助加快组合大型搜索空间的搜索速度。这正是《Learning to Delegate for Large-scale Vehicle Routing》(Li et al. 2021)一文所做的,将车辆路由速度比最先进的可用路由算法快 10 到 100 倍。

Finally, machine learning can help speed up searching through combinatorially large search spaces by learning which parts of the space to focus on or which subproblems to prioritize. This is exactly what the article “Learning to Delegate for Large-scale Vehicle Routing” (Li et al. 2021) does, speeding up vehicle routing 10 to 100 more times than the state-of-the-art available routing algorithms.

机器学习和运筹学交叉领域的类似研究正在蓬勃发展,并取得了巨大进展和可扩展的解决方案。运筹学与机器学习会议的摘要列表提供了各种各样的相关项目,例如来自废物箱中传感器的实时数据合成和处理(跟踪体积),以实现更高效的废物收集操作(因为这依赖于实时数据,团队依赖动态路由)。另一个很好的例子是自行车共享系统,其目标是预测每个地点所需的自行车数量,并分配团队有效地分配所需数量的自行车。这是摘要:

Similar research at the intersection of machine learning and operations research is booming with great progress and scalable solutions. The list of abstracts from the conference Operations Research Meets Machine Learning offers a great variety of relevant projects, such as real-time data synthesis and treatment from sensors in waste bins (tracking the volume) for more efficient waste collection operation (since this relies on real-time data, the team relies on dynamic routing). Another great example is a bike sharing system, where the objective is to predict the number of bikes needed at each location and allocate teams to distribute the required number of bikes efficiently. Here is the abstract:

自行车共享系统控制室的操作员不断地在最有可能需要的地方重新分配自行车,这需要了解每个站点所需的最佳自行车数量,以及分配团队移动自行车的最有效方法大约。预测引擎和决策优化用于计算每个站点在任何给定时间的最佳自行车数量,并规划有效的路线以帮助相应地重新分配自行车。DecisionBrain 和 IBM 为伦敦自行车共享系统提供的解决方案是同类应用中的第一个应用程序,它使用优化和机器学习来解决自行车租赁库存、分配和维护问题,并且可以轻松地重新部署用于其他自行车共享世界各地的系统。

Operators in a bike sharing system control room are constantly re-allocating bikes where they are most likely to be needed, this requires an insight on the optimum number of bikes needed in each station, and the most efficient way to distribute teams to move the bikes around. Forecasting engines and Decision Optimization is used to calculate the optimal number of bikes for each station at any given time, and plan efficient routes to help the redistribution of bikes accordingly. A solution delivered by DecisionBrain and IBM for the bike sharing system in London is the first application of its kind that uses both optimization and machine learning to solve cycle hire inventory, distribution and maintenance problems, and could easily be re-deployed for other cycle sharing systems around the world.

事实上,DecisionBrain 的项目值得浏览和思考。

In fact, DecisionBrain’s projects are worth browsing and thinking through.

目前,我和我的团队正在与我所在城市的公共交通部门合作解决一个问题。这是机器学习与运筹学相遇的完美环境。利用历史客流量数据,特别是城市每个公交车站的每日上车和下车情况,以及人口密度、人口统计、脆弱性、城市分区数据、汽车拥有量、大学入学率和停车数据,我们使用神经网络来预测供应和停车每个站点的需求模式。然后,我们利用这些数据和运筹学的优化网络设计来重新设计公交路线,使公交车站,特别是城市中社会最脆弱地区的公交车站,充分、高效地运行服务过。

Currently, my team and I are working on a problem with the Department of Public Transportation in my city. This is a perfect setting where machine learning meets operations research. Using historical ridership data, in particular daily boardings and alightings at each bus stop in the city, along with population density, demographics, vulnerability, city zoning data, car ownership, university enrollment, and parking data, we use neural networks to predict supply and demand patterns at each stop. Then we use this data and optimal network design from operations research to redesign the bus routes so that the bus stops, in particular those in the most socially vulnerable areas in the city, are adequately and efficiently serviced.

汉密尔顿-雅可比-贝尔曼方程

Hamilton-Jacobi-Bellman Equation

田野运筹学、博弈论和偏微分方程通过动态规划和Hamilton-Jacobi-Bellman偏微分方程相交。理查德·贝尔曼(Richard Bellman,数学家,1920-1984)首先在动态规划的背景下创造了“维数诅咒”这个术语。现在,维度的诅咒使得这个非常有用的方程在现实生活中的应用受到限制,并且无法将游戏的所有玩家(或竞争标记、国家、军队)纳入其中并解决他们的最佳策略,或者可以解决的数千个变量。参与运筹学问题,例如最优资源分配问题。随着深度学习的发展,潮流已经发生了逆转。论文“Solving High-Dimensional Partial Differential Equations using Deep Learning”(Han et al. 2018)提出了一种在极高维度下求解该方程和其他方程的方法。我们将在有关人工智能和偏微分方程的第 13 章中讨论作者如何做到这一点的想法。

The fields of operations research, game theory, and partial differential equations intersect through dynamic programming and the Hamilton-Jacobi-Bellman partial differential equation. Richard Bellman (mathematician, 1920–1984) first coined the term curse of dimensionality in the context of dynamic programming. Now the curse of dimensionality has rendered real-life applications of this very useful equation limited and unable to incorporate all the players of a game (or competing markers, countries, militaries) and solve for their optimal strategies, or the thousands of variables that can be involved in operations research problems, such as for optimal resource allocation problems. The tides have turned with deep learning. The paper “Solving High-Dimensional Partial Differential Equations Using Deep Learning” (Han et al. 2018) presents a method to solve this equation and others for very high dimensions. We will discuss the idea of how the authors do it in Chapter 13 on AI and partial differential equations.

人工智能运筹学

Operations Research for AI

行动调查是基于最优解决方案的决策科学。人类总是试图根据现有的情况做出决定。人工智能旨在复制人类智能的各个方面,包括决策。从这个意义上说,运筹学所采用的决策方法自动适应了人工智能。几十年来,动态规划、马尔可夫链、最优控制和 Hamilton-Jacobi-Bellman 方程、博弈论和多智能体博弈、网络优化等领域的进步随着人工智能的发展而发展。事实上,许多初创公司将自己定位为人工智能公司,而实际上他们正在做古老(而且很棒)的运筹学。

Operations research is the science of decision making based on optimal solutions. Humans are always trying to make decisions based on the available circumstances. Artificial intelligence aims to replicate all aspects of human intelligence, including decision making. In this sense, the decision-making methods that operations research employs automatically fit into AI. The ideas in dynamic programming, Markov chains, optimal control and the Hamilton-Jacobi-Bellman equation, advances in game theory and multiagent games, network optimization, and others have evolved along with AI throughout the decades. In fact, many startups market themselves as AI companies, while in reality they are doing good old (and awesome) operations research.

总结与展望

Summary and Looking Ahead

运筹学是根据当前知识和情况做出最佳决策的领域。归根结底,就是找到在极高维空间中搜索优化器的巧妙方法。

Operations research is the field of making the best decisions given the current knowledge and circumstances. It always comes down to finding clever ways to search for optimizers in very high-dimensional spaces.

贯穿本书的一个主题是维度的诅咒,以及研究人员为寻找解决方法而付出的所有努力。这种诅咒在任何领域都没有像运筹学那样广泛出现。在这里,搜索空间与特定问题中的参与者数量组合增长:路线上的城市数量、竞争实体数量、人数、商品数量等。有非常强大的精确方法和启发式方法,但是在速度和规模方面还有很大的改进空间。

One theme throughout this book is the curse of dimensionality, and all the effort researchers put in to find ways around it. In no field does this curse show up as broadly as in operations research. Here, the search spaces grow combinatorially with the number of players in a particular problem: number of cities on a route, number of competing entities, number of people, number of commodities, etc. There are very powerful exact methods and heuristic methods, but there is much room for improvement in terms of speed and scale.

机器学习,特别是深度学习,提供了一种从先前解决的问题、标记数据或模拟数据中学习的方法。如果我们识别瓶颈并能够将瓶颈的根源阐明为机器学习问题,这将加快优化搜索的速度。例如,瓶颈可能是:我们有太多的子问题需要解决,但我们不知道要优先考虑哪些子问题才能快速接近最优。为了使用机器学习来解决这个问题,我们需要一组已解决的问题和子问题的数据集,并让机器学习模型学习应该优先考虑哪些子问题。一旦模型了解了这一点,我们就可以用它来加速新问题的处理。

Machine learning, in particular deep learning, provides a way to learn from previously solved problems, labeled data, or simulated data. This speeds up optimization searches if we identify the bottlenecks and are able to articulate the source of the bottleneck as a machine learning problem. For example, a bottleneck can be: we have too many subproblems to solve but we do not know which ones to prioritize to quickly get us closer to the optimum. To use machine learning to address this, we need a data set of already solved problems and subproblems, and have a machine learning model learn which subproblems should be prioritized. Once the model learns this, we can use it to speed up new problems.

机器学习在运筹学中的其他用途包括常规类型的机器学习:根据可用数据(实时数据或历史数据)预测需求,然后使用运筹学来优化资源分配。在这里,机器学习有助于更好地预测需求,从而提高效率并减少浪费。

Other uses of machine learning in operations research include business as usual type of machine learning: predict demand from available data, either real-time or historical data, then use operations research to optimize resource allocation. Here machine learning helps make better predictions for demand and hence increases efficiency and reduces waste.

在本章中,我们对运筹学领域及其最重要的问题类型进行了广泛的概述。我们特别强调线性优化、网络和对偶性。强大的软件包可以解决许多有用的问题。我们希望这些软件包不断集成该领域的最新进展。

In this chapter, we gave a broad overview of the field of operations research and its most important types of problems. We especially emphasized linear optimization, networks, and duality. Powerful software packages are available for many useful problems. We hope that these packages keep integrating the latest progresses in the field.

运筹学入门课程中通常不会讲授的两个主题是用于多人游戏最优控制和策略的 Hamilton-Jacobi-Bellman 偏微分方程,以及使用变分法优化泛函这些通常被认为是偏微分方程中的高级主题。我们在这里讨论它们是因为它们都自然地与优化和运筹学联系在一起。此外,在这种背景下观察它们可以揭开它们相应领域的神秘面纱。

Two topics that are usually not taught in an introductory operations research classes are the Hamilton-Jacobi-Bellman partial differential equation for optimal control and strategies of multiplayer games, and optimizing functionals using calculus of variations. These are usually considered advanced topics in partial differential equations. We discussed them here because they both tie naturally into optimization and operations research. Moreover, viewing them in this context demystifies their corresponding fields.

在进行运筹学和优化以降低成本、增加收入、提高时间效率等时,重要的是我们的优化模型不要忽视人为因素。如果调度模型的输出通过不稳定的时间表来扰乱低工资工人的生活以保持公司一定的按时绩效,那么这不是一个好的模型,并且公司所依赖的工人的生活和生计质量需要量化,然后将其纳入模型中。是的,生活质量需要量化,因为其他一切都被考虑在内,我们不能忽略这一点。拥有数十万低薪工人的公司有责任确保他们的运筹学算法最终不会让他们的工人陷入贫困。

When doing operations research and optimizing for cost reduction, revenue increase, time efficiency, etc., it is important that our optimization models do not ignore the human factor. If the output of a scheduling model messes up low-wage workers’ lives through erratic schedules to keep a certain on-time company performance, then that is not a good model and the quality of the lives and livelihoods of the workers that a company relies on needs to be quantified, then factored into the model. Yes, the quality of life needs to be quantified, since everything else is being factored in, and we cannot leave this out. Companies with hundreds of thousands of low-wage workers have a responsibility that their operations research algorithms do not end up trapping their workers in poverty.

我们在本章结束时摘录了一篇论文Charles Hitch所著的《运筹学的不确定性》,1960 年出版。读到这篇文章(括号内是我的编辑),人们不禁思考自 1960 年以来运筹学领域已经走了多远:

We leave this chapter with this excerpt from a paper authored by Charles Hitch, “Uncertainties In Operations Research” from 1960. Reading this (the brackets are my edits), one cannot help but ponder how far the operations research field has come since 1960:

决策的任何其他特征都没有像不确定性那样普遍。作为运筹学研究者,为了简化分析的第一步,我们假设情况可以用确定性等价物来描述,我们可能会违背事实,而且事实上,这种暴力可能非常严重,以至于歪曲了问题并给出了结论。我们这是一个无意义的解决方案。例如,当问题的本质是没有人能够准确预测开发任何竞争装备需要多长时间时,我们如何帮助军方做出开发决策——开发哪种飞机或导弹的决策,或者为了让它们投入运行,它们将花费多少,它们的性能将如何,或者在任何不确定的未来日期相关的世界将是什么样子(如果世界确实仍然存在的话)?当我说“无法准确预测”时,我并没有夸张。例如,我们发现,新设备的生产成本在开发的早期阶段通常被低估了二到二十倍(不是百分之二到二十,而是二到二十)。为什么他们总是被低估,却从未被高估,我留给你们丰富的想象。[…​] [运筹学家]经常做的另一件事,特别是在涉及研发的问题中,是确定关键的不确定性并提出减少不确定性的策略——​购买信息。如果您不知道两种不同的导弹制导技术中哪一种会更好,那么您最好的建议很可能是:将它们都放在开发中一段时间​​,并在获得更多更好的信息时在它们之间进行选择。别介意那些说你优柔寡断的人。你可以证明这种优柔寡断既可以节省金钱又可以节省时间。当然,你不可能尝试所有的事情。没有足够的预算。没有足够的资源。你还记得我们曾经说过“如果你满足军事部门所要求的一切,他们就会尝试加固月球!” (我们必须改变这种说法。)实际上,正是由于资源的限制,运筹学和运筹学家才变得重要。如果资源不受限制的话,我们就不会有任何问题。我们的工作和机会是在有限的条件下找到或发明一些更好的模式来适应不确定的世界,如果我们不在这里,我们的更好的人会发现这种模式。或者一些更好的方法,考虑到成本和回报,购买信息以减少不确定。

No other characteristic of decision making is as pervasive as uncertainty. When, as operations researchers, to simplify a first cut at an analysis, we assume that the situation can be described by certainty equivalents, we may be doing violence to the facts and indeed the violence may be so grievous as to falsify the problem and give us a nonsense solution. How, for example, can we help the military make development decisions—​decisions about which aircraft or missiles to develop when the essence of the problem is that no one can predict with accuracy how long it will take to develop any of the competing equipments, or to get them operational, how much they will cost, what their performance will be, or what the world will be like at whatever uncertain future date turns out to be relevant (if indeed, the world still exists then)? When I say “cannot predict with accuracy” I am not exaggerating. We find that typically, for example, the production costs of new equipment are underestimated in the early stages of development by factors of two to twenty (not 2 to 20 per cent, but factors of two to twenty). Why they are always underestimated, never overestimated, I leave to your fertile imaginations. […​] Another thing that [an operations researcher] can frequently do, especially in problems involving research and development, is to ascertain the critical uncertainties and recommend strategies to reduce them—​to buy information. If you do not know which of two dissimilar techniques for missile guidance will turn out to be better, your best recommendation is very likely to be: keep them both in development a while longer and choose between them when more and better information is available. Never mind the people who call you indecisive. You can prove that this kind of indecisiveness can save both money and time. Of course you can’t afford to try everything. There isn’t enough budget. There aren’t enough resources. You remember when we used to say “If you gave the military services everything they asked for they’d try to fortify the moon!” (We’ll have to change that figure of speech.) Actually, it is because of limitations on resources that operations research and operations researchers are important. There’d be no problems for us if there were no constraints on resources. It is our job and opportunity to find, or invent, within the constraints, some better pattern of adjusting to an uncertain world than our betters would find if we weren’t here; or some better way, taking costs and pay-offs into account, to buy information to reduce the uncertainty.

第11章概率

Chapter 11. Probability

如果一切都是偶然的话,我们还能期待什么吗?

H。

Can we still expect anything, if chance is all there is?

H.

概率论是数学中最美丽的学科之一,它让我们在随机和确定性领域之间来回穿梭,这本来应该是神奇的,但结果却是数学及其奇迹。概率提供了一种系统的方法来量化随机性、控制不确定性,并将逻辑和推理扩展到人工智能中最重要的情况:当信息和知识包含不确定性时,和/或当代理在不可预测或部分观察到的环境中导航时。在这样的设置中,代理计算特定环境中未观察到的方面的概率,然后根据这些概率做出决策。

Probability theory is one of the most beautiful subjects in mathematics, moving us back and forth between the stochastic and deterministic realms in what should be magic but turns out to be mathematics and its wonders. Probability provides a systematic way to quantify randomness, control uncertainty, and extend logic and reasoning to situations that are of paramount importance in AI: when information and knowledge include uncertainties, and/or when the agent navigates unpredictable or partially observed environments. In such settings, an agent calculates probabilities about the unobserved aspects of a certain environment, then makes decisions based on these probabilities.

人类对不确定性感到不舒服,但对近似值和期望感到满意。他们醒来时并不知道一天中的每一刻都会如何度过,他们会在这一过程中做出决定。概率智能机器存在于概率世界中,而不是确定性和完全预先确定的真理和谎言。

Humans are uncomfortable with uncertainty, but are comfortable with approximations and expectations. They do not wake up knowing exactly how every moment of their day will play out, and they make decisions along the way. A probabilistic intelligent machine exists in a world of probabilities, as opposed to deterministic and fully predetermined truths and falsehoods.

在本书中,我们只在需要时才使用概率论术语和技术。通过这个过程,我们现在意识到我们需要精通联合概率分布(例如数据特征)、条件、独立性、贝叶斯定理和马尔可夫过程。我们还意识到,我们可以通过计算平均值和期望回到确定性世界。

Throughout this book, we have used probability theory terms and techniques as they came along and only when we needed them. Through this process, we now realize that we need to be well versed in joint probability distributions (for example, of features of data), conditioning, independence, Bayes’ Theorem, and Markov processes. We also realize that we can get back to the deterministic world via computing averages and expectations.

本书各章节的一个特点是,每一章都需要专门的书来进行深入、全面的论述。这对于概率论一章来说是最真实的,其中包含数千个主题。我必须做出选择,因此我根据三个标准选择了本章中涵盖的主题:

One feature of the chapters in this book is that each needs its own book to have an in-depth and comprehensive treatment. This couldn’t be more true than for a chapter on probability theory, where there are thousands of topics to include. I had to make choices, so I based the topics that I opted to cover in this chapter on three criteria:

  1. 我们在本书中已经使用过与概率有关的内容

  2. What we already used in this book that has to do with probability

  3. 作为一名学生,最让我困惑的是什么(比如为什么我们在计算概率时需要测度论?)

  4. What confused me the most in probability as a student (like why do we need measure theory when computing probabilities?)

  5. 对于人工智能应用,我们还需要从概率论中了解什么

  6. What else we need to know from probability theory for AI applications

概率在这本书中出现在哪里?

Where Did Probability Appear in This Book?

让我们做本书中我们使用概率或诉诸随机方法的地方的快速列表。我们认为这个列表是人工智能的基本概率。请注意先验概率是无条件的,因为它们先于观察数据或证据;概率是有条件的,因为它们的值取决于观察相关数据。在收到新的相关证据后,我们对某事的信念程度会发生变化,这是有道理的。所有涉及变量的联合概率分布是我们通常追求的,但它通常太大,并且完全构建它所需的信息并不总是可用的。

Let’s make a fast list of the places where we used probability or resorted to stochastic methods in this book. We consider this list as the essential probability for AI. Note that prior probabilities are unconditional, because they are prior to observing the data, or the evidence; and posterior probabilities are conditional, because their value is conditioned on observing the relevant data. It makes sense that our degree of belief about something changes after receiving new and related evidence. The joint probability distribution of all the involved variables is what we are usually after, but it is generally too large, and the information needed to fully construct it is not always available.

这是列表:

Here is the list:

  • 当最小化确定性机器学习模型(训练函数采用非随机输入并产生非随机输出)的损失函数时,例如回归、支持向量机、神经网络等,我们使用随机梯度下降及其变体,随机选择一个每个梯度下降步骤中训练数据实例的子集,而不是使用整个训练数据集,以加快计算速度。

  • When minimizing the loss function of deterministic machine learning models (where the training function takes nonrandom inputs and produces nonrandom outputs), such as regression, support vector machines, neural networks, etc., we use stochastic gradient descent and its variants, randomly choosing a subset of training data instances at each gradient descent step, as opposed to using the whole training data set, to speed up computations.

  • 在关于图模型的第 9 章中,我们多次在图上使用随机游走,通过图的加权邻接矩阵来实现这些游走。

  • In Chapter 9 on graph models, we utilized random walks on graphs on many occasions, implementing these walks via the weighted adjacency matrix of the graph.

  • 具体的概率分布出现在运筹学第 10 章中,例如队列中顾客的到达间隔和服务时间的概率分布。

  • Specific probability distributions appeared in Chapter 10 on operations research, such as probability distributions for inter-arrival and service times for customers in a queue.

  • 动态决策和马尔可夫过程也出现在运筹学第 10 章中,它们是人工智能强化学习的基础。它们将在本章中再次出现,然后在第 13 章的 Hamilton-Jacobi-Bellman 方程中再次出现。

  • Dynamic decision making and Markov processes also appeared in Chapter 10 on operations research and are fundamental for reinforcement learning in AI. They will appear again in this chapter, then once more in Chapter 13 in the context of the Hamilton-Jacobi-Bellman equation.

  • 对于第 10 章中的两人零和博弈,每个玩家都有做出某个动作的概率,我们用它来计算玩家的最优策略和预期收益。

  • For two-person zero-sum games in Chapter 10, each player had a probability of making a certain move, and we used that to compute the player’s optimal strategy and expected payoff.

  • 蒙特卡罗模拟方法是依靠重复随机采样来数值解决确定性问题的计算算法。我们在关于人工智能和偏微分方程的第 13 章中举例说明了这些例子。

  • Monte Carlo simulation methods are computational algorithms that rely on repeated random sampling to solve deterministic problems numerically. We illustrate an example of these in Chapter 13 on AI and PDEs.

  • 我们多次提到神经网络的普遍性定理,我们将在本章中证明它。这个证明是本书中唯一的理论部分,它将让我们对测度论和泛函分析有一个很好的了解。

  • We mentioned the universality theorem for neural networks many times, and we will prove it in this chapter. This proof is the only theoretical part in this book, and it will give us a nice flavor of measure theory and functional analysis.

  • 概率机器学习模型学习数据特征的联合概率分布 r X 1 , X 2 , , X n , y tArGet 而不是学习这些特征的确定性函数。这种联合概率分布对这些特征同时发生的可能性进行编码。给定输入数据特征 X 1 , X 2 , , X n ,模型输出给定数据特征的目标变量的条件概率 r y predCt | X 1 , X 2 , , X n ,而不是输出 y predCt 作为特征的确定性函数: y predCt = F X 1 , X 2 , , X n

  • Probabilistic machine learning models learn the joint probability distribution of the data features P r o b ( x 1 , x 2 , , x n , y target ) instead of learning the deterministic functions of these features. This joint probability distribution encodes the likelihood of these features occurring at the same time. Given the input data features ( x 1 , x 2 , , x n ) , the model outputs the conditional probability of the target variable given the data features P r o b ( y predict | x 1 , x 2 , , x n ) , as opposed to outputting y predict as a deterministic function of the features: y predict = f ( x 1 , x 2 , , x n ) .

  • 随机变量以及与之相关的两个最重要的量,即期望(随机变量的预期平均值)和方差(围绕平均值的分布的度量):我们一直在使用它们,但没有正式定义它们。我们将在本章中定义它们。

  • Random variables and the two most important quantities associated with them, namely the expectation (expected average value of the random variable) and variance (a measure of the spread around the average): we have been using these without formally defining them. We will define them in this chapter.

  • 概率的乘积规则或链式规则,即:

    r X 1 , X 2 = r X 1 | X 2 r X 2 = r X 2 | X 1 r X 1

    或者对于两个以上的变量,不失一般性地说三个:

    r X 1 , X 2 , X 3 = r X 1 | X 2 , X 3 r X 2 , X 3 = r X 1 | X 2 , X 3 r X 2 | X 3 r X 3
  • The product rule or the chain rule for probability, namely:

    P r o b ( x 1 , x 2 ) = P r o b ( x 1 | x 2 ) P r o b ( x 2 ) = P r o b ( x 2 | x 1 ) P r o b ( x 1 )

    or for more than two variables, say three without loss of generality:

    P r o b ( x 1 , x 2 , x 3 ) = P r o b ( x 1 | x 2 , x 3 ) P r o b ( x 2 , x 3 ) = P r o b ( x 1 | x 2 , x 3 ) P r o b ( x 2 | x 3 ) P r o b ( x 3 )
  • 的概念独立性和有条件的独立性是根本。如果一个事件的发生不影响另一个事件发生的概率,则两个事件是独立的。所考虑的功能的独立性极大地简化了。它帮助我们解开许多变量的复杂联合分布,将它们简化为更少变量的简单乘积,并使许多以前难以处理的计算变得易于处理。这极大地简化了对世界的概率解释。注意仅两个事件的独立性之间的区别( r X 1 , X 2 = r X 1 r X 2 )和许多事件的独立性,这是一个强有力的假设,其中每个事件都独立于其他事件的任何交集。

  • The concepts of independence and conditional independence are fundamental. Two events are independent if the occurrence of one does not affect the probability of occurrence of the other. Independence of the considered features is tremendously simplifying. It helps us disentangle complex joint distributions of many variables, reducing them to simple products of fewer variables, and rendering many previously intractable computations tractable. This greatly simplifies the probabilistic interpretations of the world. Pay attention to the difference between independence of only two events ( P r o b ( x 1 , x 2 ) = P r o b ( x 1 ) P r o b ( x 2 ) ) and independence of many events, which is a strong assumption where every event is independent of any intersection of the other events.

  • 对于第 8 章的概率生成模型,我们假设了一个先验概率分布,将其传递给神经网络,并调整其参数。

  • For probabilistic generative models for Chapter 8, we assumed a prior probability distribution, passed it through a neural network, and adjusted its parameters.

  • 贝叶斯定理在讨论联合概率和条件概率时至关重要。它帮助我们量化代理人相对于证据的信念。我们在很多情况下使用它,这立即说明了它的用处,例如:

    r d s e A s e | s y p t s = rsypts|dseAserdseAse rsypts

    或者

    r t A r G e t | d A t A = rdAtA|tArGetrtArGet rdAtA

    或者

    r t A r G e t | e v d e n C e = revdenCe|tArGetrtArGet revdenCe

    或者

    r C A s e | e F F e C t = reFFeCt|CAserCAse reFFeCt

    请注意,在最后一个公式中, r C A s e | e F F e C t 量化诊断方向,同时 r e F F e C t | C A s e 量化因果方向。

  • Bayes’ Theorem is essential when discussing joint and conditional probabilities. It helps us quantify an agent’s beliefs relative to evidence. We use it in many contexts, which immediately illustrate its usefulness, such as:

    P r o b ( d i s e a s e | s y m p t o m s ) = Prob(symptoms|disease)Prob(disease) Prob(symptoms)

    or

    P r o b ( t a r g e t | d a t a ) = Prob(data|target)Prob(target) Prob(data)

    or

    P r o b ( t a r g e t | e v i d e n c e ) = Prob(evidence|target)Prob(target) Prob(evidence)

    or

    P r o b ( c a u s e | e f f e c t ) = Prob(effect|cause)Prob(cause) Prob(effect)

    Note that in the last formula, P r o b ( c a u s e | e f f e c t ) quantifies the diagnostic direction, while P r o b ( e f f e c t | c a u s e ) quantifies the causal direction.

  • 贝叶斯网络是表示变量之间依赖关系的数据结构。在这里,我们总结了有向图中的变量关系,并使用它来确定我们需要跟踪和根据新证据更新哪些条件概率表:我们根据观察其父节点来跟踪子节点的概率。节点的父节点是直接影响该节点的任何变量。从这个意义上说,贝叶斯网络是联合概率分布的表示,简化后我们知道所涉及的变量如何相互关联(哪些变量是哪些变量的父级):

    r X 1 , X 2 , , X n = Π =1 n r X | p A r e n t s X
  • Bayesian networks are data structures that represent dependencies among variables. Here, we summarize the variable relationships in a directed graph and use that to determine which conditional probability tables we need to keep track of and update in the light of new evidence: we keep track of the probability of a child node conditional on observing its parents. The parents of a node are any variables that directly influence this node. In this sense, the Bayesian network is a representation of the joint probability distribution, with the simplification that we know how the involved variables relate to each other (which variables are the parents of which variables):

    P r o b ( x 1 , x 2 , , x n ) = Π i=1 n P r o b ( x i | p a r e n t s ( X i ) )
  • 在机器学习中,我们可以在回归模型和分类模型之间划清界限。在关于概率生成模型的第 8 章中,我们遇到了一个流行的概率模型:分类:朴素贝叶斯。在因果语言中,天真的假设是一些观察到的多重影响在给定原因的情况下是独立的,因此我们可以写:

    r C A s e | e F F e C t 1 , e F F e C t 2 , e F F e C t 3 = C A s e e F F e C t 1 | C A s e e F F e C t 2 | C A s e e F F e C t 3 | C A s e

    当该公式用于对给定数据特征进行分类时,原因就是类别。此外,我们可以绘制一个代表此设置的贝叶斯网络。原因变量是父节点,所有结果都是源自该变量的子节点父节点(图11-1)。

  • In machine learning we can draw a line between regression models and classification models. In Chapter 8 on probabilistic generative models, we encountered a popular probabilistic model for classification: Naive Bayes. In cause-and-effect language, the naive assumption is that some observed multiple effects are independent given a cause, so that we can write:

    P r o b ( c a u s e | e f f e c t 1 , e f f e c t 2 , e f f e c t 3 ) = P ( c a u s e ) P ( e f f e c t 1 | c a u s e ) P ( e f f e c t 2 | c a u s e ) P ( e f f e c t 3 | c a u s e )

    When that formula is used for classification given data features, the cause is the class. Moreover, we can draw a Bayesian network representing this setting. The cause variable is the parent node, and all the effects are children nodes stemming from the one parent node (Figure 11-1).

电子邮件1101
图 11-1。贝叶斯网络代表具有共同原因的三种影响

我们还需要了解哪些对人工智能至关重要的信息?

What More Do We Need to Know That Is Essential for AI?

We need a few extra topics that have either not gotten any attention in this book or were only mentioned casually and pushed to this chapter for more details. These include:

We need a few extra topics that have either not gotten any attention in this book or were only mentioned casually and pushed to this chapter for more details. These include:

  • Judea Pearl’s causal modeling and the do calculus

  • Judea Pearl’s causal modeling and the do calculus

  • Some paradoxes

  • Some paradoxes

  • Large random matrices and high-dimensional probability

  • Large random matrices and high-dimensional probability

  • Stochastic processes such as random walks, Brownian motion, and more

  • Stochastic processes such as random walks, Brownian motion, and more

  • Markov decision processes and reinforcement learning

  • Markov decision processes and reinforcement learning

  • Theory of probability and its use in AI

  • Theory of probability and its use in AI

The rest of this chapter focuses on these topics.

The rest of this chapter focuses on these topics.

Causal Modeling and the Do Calculus

Causal Modeling and the Do Calculus

In principle, the arrows between related variables in a Bayesian network can point in any direction. They all eventually lead to the same joint probability distribution, albeit some in more complicated ways than others.

In principle, the arrows between related variables in a Bayesian network can point in any direction. They all eventually lead to the same joint probability distribution, albeit some in more complicated ways than others.

In contrast, causal networks are those special Bayesian networks where the directed edges of the graph cannot point in any direction other than the causal direction. For these, we have to be more mindful when constructing the connections and their directions. Figure 11-2 shows an example of a causal Bayesian network.

In contrast, causal networks are those special Bayesian networks where the directed edges of the graph cannot point in any direction other than the causal direction. For these, we have to be more mindful when constructing the connections and their directions. Figure 11-2 shows an example of a causal Bayesian network.

电子邮件1102
Figure 11-2. Causal Bayesian network

Note that both Bayesian networks and causal networks make strong assumptions on which variables listen to which variables.

Note that both Bayesian networks and causal networks make strong assumptions on which variables listen to which variables.

Agents endowed with causal reasoning are, in human terms, higher functioning than those merely observing patterns in the data, then making decisions based on the relevant patterns.

Agents endowed with causal reasoning are, in human terms, higher functioning than those merely observing patterns in the data, then making decisions based on the relevant patterns.

The following distinction is of paramount importance:

The following distinction is of paramount importance:

  • In Bayesian networks, we suffice ourselves with knowing only whether two variables are probabilistically dependent. Are fire and smoke probabilistically dependent?

  • In Bayesian networks, we suffice ourselves with knowing only whether two variables are probabilistically dependent. Are fire and smoke probabilistically dependent?

  • In causal networks, we go further and ask about which variable responds to which variable: smoke to fire (so we draw an arrow from fire to smoke in the diagram), or fire to smoke (so we draw an arrow from smoke to fire in the diagram)?

  • In causal networks, we go further and ask about which variable responds to which variable: smoke to fire (so we draw an arrow from fire to smoke in the diagram), or fire to smoke (so we draw an arrow from smoke to fire in the diagram)?

What we need here is a mathematical framework for intervention to quantify the effect of fixing the value of one variable. This is called the do calculus (as opposed to the statistical observe and count calculus). Let’s present two fundamental formulas of the do calculus:

What we need here is a mathematical framework for intervention to quantify the effect of fixing the value of one variable. This is called the do calculus (as opposed to the statistical observe and count calculus). Let’s present two fundamental formulas of the do calculus:

  • The adjustment formula

  • The adjustment formula

  • The backdoor criterion

  • The backdoor criterion

According to Judea Pearl, the inventor of this wonderful way of causal reasoning (and whose The Book of Why (2020) inspires the discussion in this section and the next one), these allow the researcher to explore and plot all possible routes up mount intervention, no matter how twisty, and can save us the costs and difficulties of running randomized controlled trials, even when these are physically feasible and legally permissible.

According to Judea Pearl, the inventor of this wonderful way of causal reasoning (and whose The Book of Why (2020) inspires the discussion in this section and the next one), these allow the researcher to explore and plot all possible routes up mount intervention, no matter how twisty, and can save us the costs and difficulties of running randomized controlled trials, even when these are physically feasible and legally permissible.

An Alternative: The Do Calculus

An Alternative: The Do Calculus

Given a causal network, which we construct based on a combination of common sense and subject matter expertise, while at the same time throwing in extra unknown causes for each variable just to be sure that we are accounting for everything, the overarching formula is that of the joint probability distribution:

Given a causal network, which we construct based on a combination of common sense and subject matter expertise, while at the same time throwing in extra unknown causes for each variable just to be sure that we are accounting for everything, the overarching formula is that of the joint probability distribution:

P r o b ( x 1 , x 2 , , x n ) = Π i=1 n P r o b ( x i | p a r e n t s ( X i ) )

Then we intervene, applying d o ( X j = x * ) . This severs any edges pointing to X j , and affects all the conditional probabilities of the descendants of X j , leading to a new joint probability distribution that wouldn’t include the conditional probability for the intervened variable anymore. We already set its value to X j = x * with probability one, and any other value would have probability zero. Figure 11-2 shows how, when we set the sprinkler on, all arrows leading to it in the original network get severed.

Then we intervene, applying d o ( X j = x * ) . This severs any edges pointing to X j , and affects all the conditional probabilities of the descendants of X j , leading to a new joint probability distribution that wouldn’t include the conditional probability for the intervened variable anymore. We already set its value to X j = x * with probability one, and any other value would have probability zero. Figure 11-2 shows how, when we set the sprinkler on, all arrows leading to it in the original network get severed.

Thus we have:

Thus we have:

P r o b intervened ( x 1 , x 2 , , x n ) = Π ij n P r o b ( x i | p a r e n t s ( X i ) ) if X j = x * 0 otherwise

The adjustment formula

The adjustment formula

What we truly care about is how does setting X j = x * affect the probability of every other variable in the network, and we want to compute these from the original unintervened network. In math words, without the do operator, since we can just observe the data to get these values, as opposed to running new experiments.

What we truly care about is how does setting X j = x * affect the probability of every other variable in the network, and we want to compute these from the original unintervened network. In math words, without the do operator, since we can just observe the data to get these values, as opposed to running new experiments.

To this end, we introduce the adjustment formula, or controlling for confounders (possible common causes). This is a weighted average of the influence of X j and its parents on X i . The weights are the priors on the parent values:

To this end, we introduce the adjustment formula, or controlling for confounders (possible common causes). This is a weighted average of the influence of X j and its parents on X i . The weights are the priors on the parent values:

P r o b ( x i | d o ( X j = x * ) ) = P r o b intervened ( X i = x i ) = parents(X j ) P r o b ( x i | x * , p a r e n t s ( X j ) ) P r o b ( p a r e n t s ( X j ) )

Note that this formula achieves our goal of eliminating the do operator and gets us back to finding our conditional probabilities by observing the data, rather than running some costly intervention experiments, or randomized control trials.

Note that this formula achieves our goal of eliminating the do operator and gets us back to finding our conditional probabilities by observing the data, rather than running some costly intervention experiments, or randomized control trials.

The backdoor criterion, or controlling for confounders

The backdoor criterion, or controlling for confounders

There is more to the causal diagrams story. We would like to know the effect of the intervention d o ( X j = x * ) on a certain downstream variable in the diagram X down . We should be able to condition on the values of any variable in the diagram that is another ancestor. This also leads down to the downstream variable that we care about. In causal modeling, we call this process blocking the back doors or the backdoor criterion:

There is more to the causal diagrams story. We would like to know the effect of the intervention d o ( X j = x * ) on a certain downstream variable in the diagram X down . We should be able to condition on the values of any variable in the diagram that is another ancestor. This also leads down to the downstream variable that we care about. In causal modeling, we call this process blocking the back doors or the backdoor criterion:

P ( x down | d o ( X j = x * ) ) = P intervened ( X down = x down ) = ancestor(X down ) P ( x down | x * , a n c e s t o r ( X down ) ) P ( a n c e s t o r ( X down ) )

Controlling for confounders

Controlling for confounders

The most common way for scientists and statisticians to predict the effects of an intervention so that they can make statements about causality, is to control for possible common causes, or confounders. Figure 11-3 shows the variable Z as a confounder of the suspected causal relationship between X and Y.

The most common way for scientists and statisticians to predict the effects of an intervention so that they can make statements about causality, is to control for possible common causes, or confounders. Figure 11-3 shows the variable Z as a confounder of the suspected causal relationship between X and Y.

电子邮件1103
Figure 11-3. Z is a confounder of the suspected causal relationship between X and Y

This is because, in general, confounding is a main source of confusion between mere observation and intervention. It is also the source of the famous statement correlation is not causation. This is where we see some bizarre and entertaining examples: high temperature is a confounder for ice cream sales and shark attacks (but why would anyone study any sort of relationship between ice cream and sharks to start with?). The backdoor criterion and the adjustment formula easily take care of confounder obstacles to stipulating about causality.

This is because, in general, confounding is a main source of confusion between mere observation and intervention. It is also the source of the famous statement correlation is not causation. This is where we see some bizarre and entertaining examples: high temperature is a confounder for ice cream sales and shark attacks (but why would anyone study any sort of relationship between ice cream and sharks to start with?). The backdoor criterion and the adjustment formula easily take care of confounder obstacles to stipulating about causality.

We use the adjustment formula to control for confounders if we are confident that we have data on a sufficient set of deconfounder variables to block all the backdoor paths between the intervention and the outcome. To do this, we estimate the causal effect stratum by stratum from the data, then we compute a weighted average of those strata, where each stratum is weighted according to its prevalence in the population.

We use the adjustment formula to control for confounders if we are confident that we have data on a sufficient set of deconfounder variables to block all the backdoor paths between the intervention and the outcome. To do this, we estimate the causal effect stratum by stratum from the data, then we compute a weighted average of those strata, where each stratum is weighted according to its prevalence in the population.

Now, without the backdoor criterion, statisticians and scientists have no guarantee that any adjustment is legitimate. In other words, the backdoor criterion guarantees that the causal effect in each stratum of the deconfounder is in fact the observed trend in this stratum.

Now, without the backdoor criterion, statisticians and scientists have no guarantee that any adjustment is legitimate. In other words, the backdoor criterion guarantees that the causal effect in each stratum of the deconfounder is in fact the observed trend in this stratum.

Are there more rules that eliminate the do operator?

Are there more rules that eliminate the do operator?

Rules that are able to move us from an expression with the do operator (intervention) to an expression without the do operator (observation) are extremely desirable, since they eliminate the need to intervene. They allow us to estimate causal effects by mere data observation. The adjustment formula and the backdoor criterion did exactly that for us.

Rules that are able to move us from an expression with the do operator (intervention) to an expression without the do operator (observation) are extremely desirable, since they eliminate the need to intervene. They allow us to estimate causal effects by mere data observation. The adjustment formula and the backdoor criterion did exactly that for us.

Are there more rules? The more ambitious question is: is there a way to decide ahead of time whether a certain causal model lends itself to do operator elimination, so that we would know whether the assumptions of the model are sufficient to uncover the causal effect from observational data without any intervention? Knowing this is huge! For example, if the assumptions of the model are not sufficient to eliminate the do operator, then no matter how clever we are, there is no escape from running interventional experiments. On the other hand, if we do not have to intervene and still estimate causal effects, the savings are spectacular. These alone are worth digging more into probabilistic causal modeling and the do calculus.

Are there more rules? The more ambitious question is: is there a way to decide ahead of time whether a certain causal model lends itself to do operator elimination, so that we would know whether the assumptions of the model are sufficient to uncover the causal effect from observational data without any intervention? Knowing this is huge! For example, if the assumptions of the model are not sufficient to eliminate the do operator, then no matter how clever we are, there is no escape from running interventional experiments. On the other hand, if we do not have to intervene and still estimate causal effects, the savings are spectacular. These alone are worth digging more into probabilistic causal modeling and the do calculus.

To get the gist of Judea Pearl’s do calculus, we always start with a causal diagram, and think of conditioning criteria leading to the deletion of edges pointing toward or out from the variable(s) of interest. Pearl’s three rules give us the conditions under which:

To get the gist of Judea Pearl’s do calculus, we always start with a causal diagram, and think of conditioning criteria leading to the deletion of edges pointing toward or out from the variable(s) of interest. Pearl’s three rules give us the conditions under which:

  1. We can insert or delete observations:

    P r o b ( y | d o ( x ) , z , w ) = P r o b ( y | d o ( x ) , w )
  2. We can insert or delete observations:

    P r o b ( y | d o ( x ) , z , w ) = P r o b ( y | d o ( x ) , w )
  3. We can insert or delete interventions:

    P r o b ( y | d o ( x ) , d o ( z ) , w ) = P r o b ( y | d o ( x ) , w )
  4. We can insert or delete interventions:

    P r o b ( y | d o ( x ) , d o ( z ) , w ) = P r o b ( y | d o ( x ) , w )
  5. 我们可以通过观察来交换干预措施:

    r y | d X , d z , w = r y | d X , z , w
  6. We can exchange interventions with observations:

    P r o b ( y | d o ( x ) , d o ( z ) , w ) = P r o b ( y | d o ( x ) , z , w )

为了有关do微积分的更多详细信息,请参阅Judea Pearl 的“The Do-Calculus Revisited”(主题演讲,2012 年 8 月 17 日)

For more details on the do calculus, see “The Do-Calculus Revisited,” by Judea Pearl (Keynote Lecture, August 17, 2012).

悖论和图表解释

Paradoxes and Diagram Interpretations

AI代理需要能够处理悖论。我们都看过动画片,机器人在逻辑遇到悖论时陷入疯狂循环,甚至身体自拆,螺丝、弹簧满天飞。我们不能让这种情况发生。此外,悖论经常出现在非常重要的环境中,例如在制药和医学领域,因此我们在数学的镜头下仔细审视它们并仔细揭开它们的神秘面纱至关重要。

AI agents need to be able to handle paradoxes. We have all seen cartoons where a robot gets into a crazy loop or even physically self-dismantles with screws and springs flying all around when its logic encounters a paradox. We cannot let that happen. Furthermore, paradoxes often appear in very consequential settings, such as in the pharmaceutical and medical fields, so it is crucial that we scrutinize them under the lens of mathematics and carefully unravel their mysteries.

让我们回顾一下三个著名的悖论:蒙蒂·霍尔伯克森辛普森。我们将根据图表和因果模型来看待它们:蒙蒂·霍尔和伯克森悖论由于碰撞器(两个自变量指向第三个变量)而导致混乱,而辛普森悖论由于混杂因素而导致混乱(一个变量指向另外两个变量) 。人工智能代理应该配备这些图表作为其数据结构的一部分(或具有构建和调整它们的能力),以便正确推理。

Let’s go over three famous paradoxes: Monty Hall, Berkson, and Simpson. We will view them in the light of diagrams and causal models: Monty Hall and Berkson paradoxes cause confusion due to colliders (two independent variables pointing to a third one), while Simpson paradoxes cause confusion due to confounders (one variable pointing to two others). An AI agent should be equipped with these diagrams as part of its data structure (or with the ability to construct them and adjust them) in order to reason properly.

Judea Pearl 的《为什么之书》指出完美地:

Judea Pearl’s The Book of Why puts it perfectly:

悖论反映了因果关系和关联之间的紧张关系。紧张的产生是因为它们处于因果关系阶梯的两个不同的梯级(观察、干预、反事实),并且由于人类直觉在因果逻辑下运作,而数据符合概率和比例的逻辑这一事实而加剧。当我们将在一个领域学到的规则错误地运用到另一个领域时,就会出现悖论。

Paradoxes reflect the tensions between causation and association. The tension starts because they stand on two different rungs of the Ladder of Causation [observation, intervention, counterfactuals] and is aggravated by the fact that human intuition operates under the logic of causation, while data conform to the logic of probabilities and proportions. Paradoxes arise when we misapply the rules we have learned in one realm to the other.

蒙蒂·霍尔问题

Monty Hall Problem

认为你正在参加一个游戏节目,你可以选择三扇门。一扇门后面是一辆汽车,其他门后面是山羊。你选择一扇门,比如#1,主人知道门后有什么,就会打开另一扇门,比如#3,里面有一只山羊。他对你说:“你想选 2 号门吗?” 改变你选择的门对你有利吗?

Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car, behind the others, goats. You pick a door, say #1, and the host, who knows what’s behind the doors, opens another door, say #3, which has a goat. He says to you, “Do you want to pick door #2?” Is it to your advantage to switch your choice of doors?

答案是肯定的,开关门,因为不开关,你拿到车的概率是1/3,开关之后,就跃升至2/3!这里主要要注意的是,主人知道汽车在哪里,并选择打开一扇他知道里面没有汽车的门。

The answer is yes, switch doors, because without switching, your probability of getting the car is 1/3, and after switching, it jumps up to 2/3! The main thing to pay attention to here is that the host knows where the car is, and chooses to open a door that he knows does not have the car in it.

那么,如果我们改变最初的选择,为什么获胜的概率会翻倍呢?因为主机提供了新信息,只有当我们从最初的无信息选择转变时,我们才会利用这些信息:

So why would the probability of winning double if we switch from our initial choice? Because the host offers new information that we would leverage only if we switch from our initial information-less choice:

无切换策略下
Under the no-switch strategy
  • 如果我们最初选择了获胜的门(概率 1/3),并且我们不改变,那么我们就赢了。

  • 如果我们最初选择了一个失败的门(概率 2/3),并且我们不切换,那么我们就失败了。

  • If we initially chose the winning door (probability 1/3), and we do not switch, then we win.

  • If we initially chose a losing door (probability 2/3), and we do not switch, then we lose.

这意味着在无切换策略下我们只能获胜 1/3。

This means that we would win only 1/3 of the time under the no-switch strategy.

切换策略下
Under the switch strategy
  • 如果我们最初选择了获胜的门(概率 1/3),然后我们改变了方向,那么我们就会失败。

  • 如果我们最初选择一扇失败的门(概率 2/3),新的信息指向另一扇失败的门,然后我们切换,那么我们就会获胜,因为剩下的唯一门将是获胜的门。

  • If we initially chose the winning door (probability 1/3), and we switch from it, then we would lose.

  • If we initially chose a losing door (probability 2/3), new information comes pointing to the other losing door, and we switch, then we would win, because the only door left would be the winning door.

这意味着在切换策略下我们将赢得 2/3 的时间。

This means that we would win 2/3 of the time under the switch strategy.

当我们绘制图11-4中的图来表示这个游戏时,我们意识到主人选择打开的门有两个父母指向它:你选择的门和汽车的位置。

When we draw the diagram in Figure 11-4 to represent this game, we realize that the door that the host chooses to open has two parents pointing toward it: the door you chose and the location of the car.

电子邮件1104
图 11-4。蒙蒂霍尔悖论涉及的变量的因果图

对这个对撞机的调节会改变父母的概率。它在原本独立的父母之间造成了一种虚假的依赖!这就像我们一旦见到父母的孩子,就会改变对父母遗传特征的看法。这些都是无缘无故的相关性,是当我们以对撞机为条件时引发的。

Conditioning on this collider changes the probabilities of the parents. It creates a spurious dependency between originally independent parents! This is similar to us changing our beliefs about the genetic traits of parents once we meet one of their children. These are causeless correlations, induced when we condition on colliders.

现在假设主人在不知道是赢门还是输门的情况下选择了自己的门。那么切换或不切换都不会改变赢得汽车的几率,因为在这种情况下,你和主人都有相同的机会赢得 1/3 的时间和失去 2/3 的时间。现在,当我们为这个完全随机且无先验知识的游戏绘制图表时,汽车的位置和主持人选择打开的门之间没有箭头,因此您选择的门和汽车的位置即使在调节后仍保持独立主人的选择。

Now suppose that the host chooses their door without knowing whether it is a winning or a losing door. Then switching or nonswitching would not change the odds of winning the car, because in this case both you and the host have equal chances of winning 1/3 of the time and losing 2/3 of the time. Now when we draw the diagram for this totally random and no-prior-knowledge game, there is no arrow between the location of the car and the door that the host chooses to open, so your choice of the door and the location of the car remain independent even after conditioning on the host’s choice.

伯克森悖论

Berkson’s Paradox

1946年,梅奥诊所的生物统计学家约瑟夫·伯克森(Joseph Berkson)指出了在医院环境中进行的观察性研究的一个特点:即使两种疾病在普通人群中彼此没有关系,它们在医院的患者中似乎也存在相关性。1979 年,各种统计偏差方面的专家、麦克马斯特大学的戴维·萨克特 (David Sackett) 提供了强有力的证据,证明伯克森悖论是真实存在的。在一个例子中,他研究了两组疾病:呼吸系统疾病和骨骼疾病。一般人群中约有 7.5% 的人患有骨骼疾病,这一比例与他们是否患有呼吸道疾病无关。但对于患有呼吸道疾病的住院患者来说,骨骼疾病的发生率跃升至25%!萨克特将这种现象称为“录取率偏差”或“伯克森偏差”。

In 1946, Joseph Berkson, a biostatistician at the Mayo Clinic, pointed out a peculiarity of observational studies conducted in a hospital setting: even if two diseases have no relation to each other in the general population, they can appear to be associated among patients in a hospital. In 1979, David Sackett of McMaster University, an expert on all sorts of statistical bias, provided strong evidence that Berkson’s paradox is real. In one example, he studied two groups of diseases: respiratory and bone. About 7.5% of people in the general population have a bone disease, and this percentage is independent of whether they have respiratory disease. But for hospitalized people with respiratory disease, the frequency of bone disease jumps to 25%! Sackett called this phenomenon “admission rate bias” or “Berkson bias.”

与蒙蒂·霍尔案例类似,出现伯克森悖论的罪魁祸首是一张对撞图,两种原本独立的疾病都指向住院治疗:患有两种疾病的患者比仅患有其中一种疾病的患者住院的可能性要大得多。当我们以住院治疗(即对撞机)为条件时,就会出现初始自变量之间无原因相关的情况。我们现在已经习惯了碰撞偏差。

Similar to the Monty Hall case, the culprit for the appearance of the Berkson paradox is a collider diagram, where both originally independent diseases point to hospitalization: a patient with both diseases is much more likely to be hospitalized than a patient with only one of them. When we condition on hospitalization, which is the collider, a case of causeless correlation between the initially independent variables appears. We are getting used to collider bias now.

辛普森悖论

Simpson’s Paradox

想象一个这个悖论的结论,如果听之任之,是荒谬的:当我们知道病人的性别时,我们就不应该开这种药,因为数据显示这种药对男性有害,对女性有害;但如果性别未知,那么我们应该开药,因为数据表明该药对一般人群有益。这显然是荒谬的,我们的第一反应应该是抗议:给我看数据!

Imagine a paradox whose conclusion, if left to its own devices, is this absurd: when we know the gender of the patient, then we should not prescribe the drug, because the data shows that the drug is bad for males and bad for females; but if the gender is unknown, then we should prescribe the drug, because the data shows that the drug is good for the general population. This is obviously ridiculous, and our first instinct should be to protest: show me the data!

当趋势出现在多个群体中,但当群体合并时趋势消失或逆转时,我们认识到辛普森悖论。

We recognize Simpson’s paradox when a trend appears in several groups of the population but disappears or reverses when the groups are combined.

我们首先要揭穿这个悖论。这是如何添加分数(或比例)的简单数字错误。综上所述,当我们添加分数时,我们不能简单地将各自的分子和分母相加:

Let’s first debunk the paradox. It is a simple numerical mistake of how to add fractions (or proportions). In summary, when we add fractions, we cannot simply add the respective numerators and the denominators:

A > A C D > C d A+C +D > A+C +d

例如,假设数据显示:

For example, suppose that the data shows that:

  • 服用该药物的女性中有 3/40 患有心脏病,而未服用该药物的女性中这一比例仅为 1/20 (3/40 > 1/20)。

  • 3/40 of the women who took the drug had a heart attack, compared to only 1/20 of the women who did not take the drug (3/40 > 1/20).

  • 服用该药物的男性中有 8/20 患有心脏病,而未服用该药物的男性中有 12/40 患有心脏病 (8/20 > 12/40)。

  • 8/20 of the men who took the drug had a heart attack, compared to 12/40 of the men who did not take the drug (8/20 > 12/40).

现在,当我们合并女性和男性的数据时,不平等方向相反:3/40 > 1/20 和 8/20 > 12/40,但正确的是 (3 + 8)/(40 + 20) < (1 + 12) )/(20 + 40)。换句话说:在服用该药物的 60 名男性和女性中,有 11 人心脏病发作,而在未服用该药物的 60 名男性和女性中,有 13 人心脏病发作。

Now when we merge the data for women and men, the inequality reverses direction: 3/40 > 1/20 and 8/20 > 12/40, but rightfully (3 + 8)/(40 + 20) < (1 + 12)/(20 + 40). In other words: of the 60 men and women who took the drug, 11 had a heart attack, and of the 60 men and women who did not take the drug, 13 had a heart attack.

然而,当我们以这种方式合并数据时,我们在分数方面犯了一个简单的错误。为了解决辛普森悖论,我们不应该通过简单地将分子和分母相加来合并数据并期望不等式成立。请注意,服用该药物的 60 人中,有 40 名女性和 20 名男性;而在未服用该药物的 60 人中,有 20 人是女性,40 人是男性。我们正在比较苹果和橙子,并将其与性别混为一谈。性别会影响是否服用药物以及是否会发生心脏病。图 11-5中的图表说明了这种混杂因素的关系。

However, we committed a simple mistake with fractions when we merged the data this way. To solve Simpson’s paradox, we should not merge the data by simply adding the numerators and the denominators and expect the inequality to hold. Note that of the 60 people who took the drug, 40 are women and 20 are men; while of the 60 people who did not take the drug, 20 are women and 40 are men. We are comparing apples and oranges and confounding that with gender. The gender affects both whether the drug is administered and whether a heart attack happens. The diagram in Figure 11-5 illustrates this confounder relationship.

电子邮件1105
图 11-5。性别是服用药物和心脏病发作的混杂因素

我们的强烈直觉是,如果我们天真地合并比例,就会出现问题,这一点是正确的。如果事情在地方层面上到处都是公平的,那么它们在全球范围内也是公平的;或者,如果事情在每个地方层面都以某种方式运作,那么我们应该期望它们在全球范围内以同样的方式运作。

Our strong intuition that something is wrong if we merge proportions naively is spot on. If things are fair on a local level everywhere, then they are fair globally; or if things act a certain way on every local level, then we should expect them to act that way globally.

这种错误经常发生也就不足为奇了,因为人类直到最近才正确计算分数。在继承和贸易等领域,有些古代文本在处理分数时存在错误。我们的大脑对分数的抵制似乎持续存在:我们在七年级学习分数,而这也恰好是我们可以追溯到许多人对数学的传奇般的仇恨根源的时期。

It is no surprise that this mistake happens so often, as humans did not get fractions right until relatively recently. There are ancient texts with mistakes manipulating fractions in domains such as inheritance and trade. Our brains’ resistance to fractions seems to persist: we learn fractions in seventh grade, and that also happens to be the time to which we can trace the origin of many people’s legendary hatred for math.

那么合并数据的正确方法是什么呢?我们七年级的智慧告诉我们使用公分母 40 并以性别为条件:女性 3/40 > 2/40,男性 16/40 > 12/40。既然在总人口中男性和女性的分布是相等的,我们应该取平均值并正确地得出结论:(3/40 + 16/40)/2 > (2/40 + 12/40)/2;也就是说,一般人群中服用该药的心脏病发作率为23.75%,未服用该药的心脏病发作率为17.5%。这里没有发生神奇和不合逻辑的逆转。而且,这药效很不好!

So what is the correct way to merge the data? Our grade-seven wisdom tells us to use the common denominator 40 and to condition on gender: for women 3/40 > 2/40 and for men 16/40 > 12/40. Now since in the general population men and women are equally distributed, we should take the average and rightfully conclude that (3/40 + 16/40)/2 > (2/40 + 12/40)/2; that is, the rate of heart attacks in the general population is 23.75% with the drug, and 17.5% without the drug. No magical and illogical reversal happened here. Moreover, this drug is pretty bad!

大型随机矩阵

Large Random Matrices

大多数人工智能应用处理大量高维数据(大数据),这些数据以高维向量、矩阵或张量的形式组织,表示数据表、图像、自然语言、图网络等。很多数据都是有噪声的或具有内在的随机性。为了处理这些数据,我们需要一个结合了概率和统计(通常处理标量随机变量)和线性代数(处理向量和矩阵)的数学框架。

Most AI applications deal with a vast amount of high-dimensional data (big data), organized in high-dimensional vectors, matrices, or tensors, representing data tables, images, natural language, graph networks, and others. A lot of this data is noisy or has an intrinsic random nature. To process such data, we need a mathematical framework that combines probability and statistics, which usually deal with scalar random variables, with linear algebra, which deals with vectors and matrices.

均值和方差仍然是中心思想,因此我们发现许多陈述和结果包含所涉及的高维随机变量的期望和方差(不确定性)。与标量情况类似,棘手的部分是控制方差,因此文献中的大量工作都在随机变量分布的尾部找到界限(不等式),或者在一定距离内找到随机变量的可能性有多大它的意思是。

The mean and variance are still central ideas, so we find many statements and results containing the expectation and variance (uncertainty) of the involved high-dimensional random variables. Similar to the scalar case, the tricky part is controlling the variance, so a lot of work in the literature finds bounds (inequalities) on the tails of the random variables’ distributions or how likely it is to find a random variable within some distances from its mean.

由于我们现在有了矩阵值随机变量,因此许多结果都试图理解其谱的行为(分布):特征值和特征向量。

Since we now have matrix valued random variables, many results seek to understand the behaviors (distributions) of their spectra: eigenvalues and eigenvectors.

随机向量和随机矩阵的示例

Examples of Random Vectors and Random Matrices

没有令人惊奇的是,大型随机矩阵的研究演变成了自己的理论。它们出现在各种有影响力的应用中,从金融到神经科学到物理学和技术设备的制造。以下仅是示例的一部分。这些都有很大的影响,因此它们周围都有大型的数学社区。

It is no wonder the study of large random matrices evolved into its own theory. They appear in all sorts of impactful applications, from finance to neuroscience to physics and the manufacture of technological devices. The following is only a sampling of examples. These have great implications, so there are large mathematical communities around each them.

量化金融

Quantitative finance

一个例子随机向量的一个是量化金融中的投资组合。我们经常需要决定如何投资大量价格变动随机的股票,以获得最佳业绩。投资组合本身是一个随时间演变的大型随机向量。本着同样的精神,纳斯达克股票(纳斯达克包含超过 2,500 只股票)的每日收益是一个随时间演化的大型随机向量。

One example of a random vector is an investment portfolio in quantitative finance. We often need to decide on how to invest in a large number of stocks, whose price movement is stochastic, for optimal performance. The investment portfolio itself is a large random vector that evolves with time. In the same spirit, the daily returns of Nasdaq stocks (Nasdaq contains more than 2,500 stocks) is a time-evolving large random vector.

神经科学

Neuroscience

另一个例子来自神经科学。当对大脑中神经元之间的突触连接网络进行建模时,会出现随机矩阵。在一定长度的t 个连续时间间隔内, n 个神经元发出的尖峰数量是 n × t 随机矩阵。

Another example is from neuroscience. Random matrices appear when modeling a network of synaptic connections between neurons in the brain.The number of spikes fired by n neurons during t consecutive time intervals of a certain length is an n × t random matrix.

数学物理:维格纳矩阵

Mathematical physics: Wigner matrices

在数学物理中,特别是在核物理学领域,物理学家尤金·维格纳引入了随机矩阵来模拟重原子的原子核及其光谱。简而言之,他将重原子核光谱中线之间的间距与随机矩阵特征值之间的间距联系起来。

In mathematical physics, particularly in nuclear physics, physicist Eugene Wigner introduced random matrices to model the nuclei of heavy atoms and their spectra. In a nutshell, he related the spacings between the lines in the spectrum of a heavy atom’s nucleus to the spacings between the eigenvalues of a random matrix.

维格纳开始使用的确定性矩阵是系统的哈密顿量,它是描述原子核中包含的中子和质子之间所有相互作用的矩阵。对哈密顿量进行对角化来找到原子核能级的任务是不可能的,因此维格纳寻找替代方案。他完全放弃了精确性和决定论,并从概率的角度来处理这个问题。他没有问能量水平到底是多少,而是提出了以下问题:

The deterministic matrix that Wigner started with is the Hamiltonian of the system, which is a matrix describing all the interactions between the neutrons and protons contained in the nucleus. The task of diagonalizing the Hamiltonian to find the energy levels of the nucleus was impossible, so Wigner looked for an alternative. He abandoned exactness and determinism altogether and approached the question from a probabilistic perspective. Instead of asking what precisely are the energy levels, he asked questions like:

  • 在一定区间内找到能级的概率是多少?

  • What is the probability of finding an energy level within a certain interval?

  • 两个连续能级之间的距离在一定范围内的概率是多少?

  • What is the probability that the distance between two successive energy levels is within a certain range?

  • 我们可以用具有正确对称性的纯随机矩阵来替换系统的哈密顿量吗?例如,在时间反转下不变的量子系统的情况下,哈密顿量是一个实数对称矩阵(无限大小)。在存在磁场的情况下,哈密顿量是一个复数埃尔米特矩阵(实数对称矩阵的复数模拟)。在存在自旋轨道耦合(量子物理术语)的情况下,哈密顿量是辛的(另一种特殊类型的对称矩阵)。

  • Can we replace the Hamiltonian of the system by a purely random matrix with the correct symmetry properties? For example, in the case of quantum systems invariant under time reversal, the Hamiltonian is a real symmetric matrix (of infinite size). In the presence of a magnetic field, the Hamiltonian is a complex, Hermitian matrix (the complex analog of a real symmetric matrix). In the presence of spin-orbit coupling (a quantum physics term), the Hamiltonian is symplectic (another special type of symmetric matrix).

类似地,维格纳型随机矩阵出现在凝聚态物理学中,我们使用实对称维格纳矩阵来模拟原子对或自旋对之间的相互作用。总体而言,维格纳矩阵被认为是随机矩阵理论中的经典矩阵。

Similarly, Wigner-type random matrices appear in condensed matter physics, where we model the interaction between pairs of atoms or pairs of spins using real symmetric Wigner matrices. Overall, Wigner matrices are considered classical in random matrix theory.

多元统计:Wishart 矩阵和协方差

Multivariate statistics: Wishart matrices and covariance

在多元统计中,John Wishart 在想要估计大型随机向量的样本协方差矩阵时引入了随机矩阵。Wishart 随机矩阵也被认为是随机矩阵理论中的经典。请注意,样本协方差矩阵是总体协方差矩阵的估计。

In multivariate statistics, John Wishart introduced random matrices when he wanted to estimate sample covariance matrices of large random vectors. Wishart random matrices are also considered classical in random matrix theory. Note that a sample covariance matrix is an estimation for the population covariance matrix.

在处理样本协方差矩阵时,常见的设置是观察t次的n维变量,即原始数据集是一个大小为 n × t 。例如,我们可能需要估计大量资产(使用较小样本)收益的协方差矩阵,例如 2,500 只纳斯达克股票的每日收益。如果我们使用 5 年的每日数据,假设一年有 252 个交易日,那么这 2,500 只股票中的每只都有 5 × 252 = 1,260 个数据点。原始数据集是一个大小为 2,500 × 1,260 的矩阵。这是观察数少于变量数的情况。我们还有其他情况相反,以及观察数量和变量数量规模截然不同的限制情况。在所有情况下,我们都对样本协方差矩阵的特征值的规律(概率分布)感兴趣。

When dealing with sample covariance matrices, a common setting is that of n dimensional variables observed t times, that is, the original data set is a matrix of size n × t . For example, we might need to estimate the covariance matrix of the returns of a large number of assets (using a smaller sample), such as the daily returns of the 2,500 Nasdaq stocks. If we use 5 years of daily data, given that there are 252 trading days in a year, then we have 5 × 252 = 1,260 data points for each of the 2,500 stocks. The original data set would be a matrix of size 2,500 × 1,260. This is a case where the number of observations is smaller than the number of variables. We have other cases where it is the other way around, as well as limiting cases where the number of observations and the number of variables are of drastically different scales. In all cases, we are interested in the law (probability distribution) for the eigenvalues of the sample covariance matrix.

让我们写出协方差矩阵条目的公式。对于一个变量 z 1 (比如说一只股票),t次观察的平均值为 z 1 ,我们有方差:

Let’s write the formulas for the entries of the covariance matrix. For one variable z 1 (say one stock) with t observations whose mean (average) is z ¯ 1 , we have the variance:

σ 1 2 = z 1 1-z 1 2 +z 1 2-z 1 2 ++z 1 t-z 1 2 t

类似地,对于n 个变量中的每一个 z ,我们有他们的方差 σ 2 。它们位于协方差矩阵的对角线上。现在每个非对角线 σ j 条目是相应变量对的协方差:

Similarly, for each of the n variables z i , we have their variance σ i 2 . These sit on the diagonal of the covariance matrix. Now each off-diagonal σ ij entry is the covariance of the corresponding pair of variables:

σ j = z 1-z z j 1-z j +z 2-z z j 2-z j ++z t-z z j t-z j t

协方差矩阵是对称且正定的(具有正特征值)。协方差矩阵中的随机性通常源于噪声观测。由于测量噪声是不可避免的,因此确定协方差矩阵在数学上变得更加复杂。另一个常见问题是样本通常不是独立的。相关样本引入了某种冗余,因此我们期望样本协方差矩阵的表现就好像我们观察到的样本比实际观察到的样本少。然后,我们必须在存在相关样本的情况下分析样本协方差矩阵。

The covariance matrix is symmetric and positive definite (has positive eigenvalues). The randomness in a covariance matrix usually stems from noisy observations. Since measurement noise is inevitable, determining the covariance matrix becomes more involved mathematically. Another common issue is that often the samples are not independent. Correlated samples introduce some sort of redundancy, so we expect that the sample covariance matrix behaves as if we had observed fewer samples than we actually did. We must then analyze the sample covariance matrix in the presence of correlated samples.

动力系统

Dynamical systems

线性化动力系统接近平衡( dX t dt = A X t )。在混沌系统的背景下,我们想要了解初始条件的微小差异如何随着动力学的展开而传播。一种方法是对未受扰动轨迹附近的动力学进行线性化。扰动作为矩阵的乘积而演化,对应于应用于初始扰动的线性化动力学。

Linearized dynamical systems are near equilibria ( dx (t) dt = A x ( t ) ). In the context of chaotic systems, we want to understand how a small difference in initial conditions propagates as the dynamics unfolds. One approach is to linearize the dynamics in the vicinity of the unperturbed trajectory. The perturbation evolves as a product of matrices, corresponding to the linearized dynamics, applied on the initial perturbation.

矩阵乘法算法

Algorithms for Matrix Multiplication

寻找高效矩阵乘法算法是一个重要但又极其困难的目标。在矩阵乘法算法中,即使节省一次乘法运算也是值得的(节省加法运算并不是什么大问题)。最近,DeepMind 开发了AlphaTensor (2022),以自动发现更高效的矩阵乘法算法。这是一个里程碑,因为矩阵乘法是神经网络、计算机图形学和科学计算等众多技术的基本组成部分。

Finding efficient algorithms for matrix multiplication is an essential, yet surprisingly difficult, goal. In matrix multiplication algorithms, saving on even one multiplication operation is worthy (saving on addition is not as much of a big deal). Recently, DeepMind developed AlphaTensor (2022) to automatically discover more efficient algorithms for matrix multiplication. This is a milestone because matrix multiplication is a fundamental part of a vast array of technologies, including neural networks, computer graphics, and scientific computing.

其他同样重要的例子

Other equally important examples

还有其他例子。在数论中,我们可以使用某些随机矩阵的特征值分布对黎曼 zeta 函数的零点分布进行建模。对于那些关注量子计算的人来说,这里有一个历史注释:在薛定谔方程之前,海森堡用他所谓的矩阵力学来表述量子力学。最后,我们将在第 13 章遇到概率演化的主方程。这涉及到从系统的一种状态到一种状态的转移概率的大矩阵另一个状态。

There are other examples. In number theory, we can model the distribution of zeros of the Riemann zeta function using the distribution of eigenvalues of certain random matrices. For those keeping an eye out for quantum computing, here’s a historical note: before Schrödinger’s equation, Heisenberg formulated quantum mechanics in terms of what he named matrix mechanics. Finally, we will encounter the master equation for the evolution of probabilities in Chapter 13. This involves a large matrix of transition probabilities from one state of a system to another state.

随机矩阵理论的主要考虑因素

Main Considerations in Random Matrix Theory

根据在问题的表述中,出现的矩阵要么是确定性的,要么是随机的。对于确定性向量和矩阵,适用经典数值线性代数,但极高的维数迫使我们使用随机化来有效地进行矩阵乘法(通常是 n 3 )、分解并计算谱(特征值和特征向量)。

Depending on the formulation of the problem, the matrices that appear are either deterministic or random. For deterministic vectors and matrices, classical numerical linear algebra applies, but the extreme high dimensionality forces us to use randomization to efficiently do matrix multiplication (usually O ( n 3 ) ), decomposition, and computing the spectrum (eigenvalues and eigenvectors).

大量的矩阵属性都封装在它们的谱中,因此我们通过研究这些特征值和特征向量来了解矩阵的大量知识。在随机领域,当矩阵是随机的时,它们也是随机的。那么我们如何计算它们并找到它们的概率分布(甚至只是它们的均值和方差或它们的界限)?这些是大型随机矩阵(或随机线性代数,或高维概率)领域解决的问题类型。我们通常关注:

A substantial amount of matrix properties is encapsulated in their spectra, so we learn a great deal about matrices by studying these eigenvalues and eigenvectors. In the stochastic realm, when the matrices are random, these are random as well. So how do we compute them and find their probability distributions (or even only their means and variances or bounds on those)? These are the types of questions that the field of large random matrices (or randomized linear algebra, or high-dimensional probability) addresses. We usually focus on:

涉及的随机数学对象
The involved stochastic math objects

随机向量和随机矩阵。随机向量或随机矩阵的每个条目都是一个随机变量。这些可以是静态随机变量,也可以随时间变化。当随机变量随时间演变时,它就成为随机或随机过程。显然,随机过程比静态过程更涉及数学。例如,对于随时间变化的方差我们能说什么呢?

Random vectors and random matrices. Each entry of a random vector or a random matrix is a random variable. These could either be static random variables or evolving with time. When a random variable evolves with time, it becomes a random or stochastic process. Obviously, stochastic processes are more involved mathematically than their static counterparts. For example, what can we say about variances that evolve with time?

随机预测
Random projections

我们的兴趣始终在于投影到一些低维空间,同时保留基本信息。这些通常涉及将矩阵与向量相乘或将矩阵分解为更简单矩阵的乘积,例如奇异值分解。当数据很大并且条目是随机的时我们如何做到这一点?

Our interest is always in projecting onto some lower-dimensional spaces while preserving essential information. These usually involve either multiplying matrices with vectors or factorizing matrices into a product of simpler matrices, such as the singular value decomposition. How do we do these when the data is large and the entries are random?

随机矩阵的加法和乘法
Adding and multiplying random matrices

请注意,标量随机变量的和与积也是随机变量,并且它们的分布已得到很好的研究。同样,时间演化标量随机变量的和与积是布朗运动和随机微积分的基础,有大量文献支持它们。这个理论如何过渡到更高维度?

Note that the sums and products of scalar random variables are also random variables, and their distributions are well studied. Similarly, the sums and products of time evolving scalar random variables, which are the foundation for Brownian motion and stochastic calculus, have a large body of literature supporting them. How does this theory transition to higher dimensions?

计算光谱
Computing the spectra

我们如何计算随机矩阵的谱,并探索其(随机)特征值和特征向量的属性?

How do we compute the spectrum of a random matrix, and explore the properties of its (random) eigenvalues and eigenvectors?

计算随机矩阵的和与积的谱
Computing the spectra of sums and products of random matrices

我们也如何做到这一点?

How do we do this as well?

乘以许多随机矩阵,而不是仅乘以两个
Multiplying many random matrices, as opposed to only two

这个问题出现在技术行业的许多环境中,例如,当研究光在一系列不同光学指数的板中的传输、电子在无序导线中的传播、或者位移在粒状介质中传播的方式时。

This problem appears in many contexts in the technological industry, for example, when studying the transmission of light in a succession of slabs of different optical indices, or the propagation of an electron in a disordered wire, or the way displacements propagate in granular media.

矩阵的贝叶斯估计
Bayesian estimation for matrices

贝叶斯的任何事情总是与在给定证据的情况下估计某件事的概率有关。在这里,我们开始的矩阵(观察矩阵)是我们关心的真实矩阵的噪声版本。噪声可以是相加的,因此观测矩阵E = 真实矩阵 + 随机噪声矩阵。噪声也可以是乘法的,因此观测矩阵E = 真实矩阵×随机噪声矩阵。一般来说,我们不知道真实的矩阵,并且想知道在我们观察到噪声矩阵的情况下该矩阵的概率。那是,我们必须计算Prob(真实矩阵|噪声矩阵)

Bayesian anything always has to do with estimating the probability of something given some evidence. Here, the matrix we start with (the observations matrix) is a noisy version of the true matrix that we care for. The noise can be additive, so the observed matrix E = true matrix + a random noise matrix. The noise can also be multiplicative, so the observed matrix E = true matrix × random noise matrix. In general, we do not know the true matrix, and would like to know the probability of this matrix given that we have observed the noisy matrix. That is, we have to compute Prob(true matrix|noisy matrix).

随机矩阵系综

Random Matrix Ensembles

多数情况在应用程序中,我们遇到没有特定结构的(随机或确定性)大矩阵。随机矩阵理论的主要前提是我们可以用某个随机矩阵集合的典型元素(期望元素)来替换如此大的复杂矩阵。大多数时候,我们将注意力限制在具有实数项的对称矩阵上,因为这些矩阵是数据分析和统计物理学中最常见的矩阵。值得庆幸的是,这些更容易进行数学分析。

In most applications, we encounter (stochastic or deterministic) large matrices with no particular structure. The main premise that underlies random matrix theory is that we can replace such a large complex matrix by a typical element (expected element) of a certain ensemble of random matrices. Most of the time we restrict our attention to symmetric matrices with real entries, since these are the ones that most commonly arise in data analysis and statistical physics. Thankfully, these are easier to analyze mathematically.

说到数学,我们喜欢多项式函数。它们是非线性的,足够复杂,足以捕捉我们周围世界的足够复杂性,并且易于评估和计算。当我们研究大型随机矩阵时,会出现一种经过充分研究的特殊类型的多项式:正交多项式。一个正交多项式序列是一系列多项式,其中序列中的任何两个不同多项式在某个内积(广义点积)下彼此正交(它们的内积为零)。最广泛使用的正交多项式序列是:埃尔米特多项式、拉盖尔多项式和雅可比多项式(其中包括切比雪夫多项式和勒让德多项式的重要类别)。正交多项式领域主要发展于 19 世纪末,其中最著名的名字是切比雪夫、马尔可夫和斯蒂尔切斯。难怪这些名字在概率论中随处可见,从切比雪夫不等式到马尔可夫链和过程,再到斯蒂尔切斯变换。

Speaking of mathematics, we love polynomial functions. They are nonlinear, complex enough to capture enough complexities in the world around us, and are easy to evaluate and do computations with. When we study large random matrices, a special type of well-studied polynomials appears: orthogonal polynomials. An orthogonal polynomial sequence is a family of polynomials such that any two different polynomials in the sequence are orthogonal (their inner product is zero) to each other under some inner product, which is a generalized dot product. The most widely used orthogonal polynomials sequences are: Hermite polynomials, the Laguerre polynomials, and the Jacobi polynomials (these include the important classes of Chebyshev polynomials and Legendre polynomials). The famous names in the field of orthogonal polynomials, which was mostly developed in the late 19th century, are Chebyshev, Markov, and Stieltjes. No wonder these names are everywhere in probability theory, from Chebyshev inequalities to Markov chains and processes to Stieltjes transforms.

以下三个基本随机矩阵系综与正交多项式密切相关:

The following three fundamental random matrix ensembles are intimately related to orthogonal polynomials:

维格纳
Wigner

这是高斯分布的矩阵等价。1 × 1 维格纳矩阵是单个高斯随机数。这与 Hermite 正交多项式密切相关。高斯分布及其相关的埃尔米特多项式在基础变量上下无界的情况下非常自然地出现。维格纳随机矩阵的特征多项式的平均值服从简单的递归关系,这使得我们可以将它们表示为 Hermite 多项式。维格纳系综是所有随机矩阵系综中最简单的。这些矩阵的所有元素都是高斯随机变量,唯一的约束是矩阵是实对称矩阵(高斯正交系综)、复埃尔米特矩阵(高斯酉系综)或辛矩阵(高斯辛系综)。

This is the matrix equivalent of the Gaussian distribution. A 1 × 1 Wigner matrix is a single Gaussian random number. This is intimately related to Hermite orthogonal polynomials. The Gaussian distribution and its associated Hermite polynomials appear very naturally in contexts where the underlying variable is unbounded above and below. The average of the characteristic polynomials of Wigner random matrices obey simple recursion relations that allow us to express them as Hermite polynomials. Wigner ensemble is the simplest of all ensembles of random matrices. These are matrices where all elements are Gaussian random variables, with the only constraint that the matrix is real symmetric (the Gaussian orthogonal ensemble), complex Hermitian (the Gaussian unitary ensemble), or symplectic (the Gaussian symplectic ensemble).

威沙特
Wishart

是伽玛分布的矩阵等价。1 × 1 Wishart 是伽马分布数。这与拉盖尔正交多项式密切相关。伽玛分布和拉盖尔多项式出现在变量受下界限制的问题中(例如,正变量)。Wishart 随机矩阵的特征多项式的平均值服从简单的递归关系,这使我们能够将它们表示为拉盖尔多项式。

This is the matrix equivalent of the gamma distribution. A 1 × 1 Wishart is a gamma distributed number. This is intimately related to Laguerre orthogonal polynomials. Gamma distributions and Laguerre polynomials appear in problems where the variable is bounded from below (e.g., positive variables). The average of the characteristic polynomials of Wishart random matrices obey simple recursion relations that allow us to express them as Laguerre polynomials.

雅可比
Jacobi

这是beta 分布的矩阵等价。1 × 1 Wishart 是一个 beta 分布式数字。这与雅可比正交多项式密切相关。Beta 分布和雅可比多项式出现在变量从上到下有界的问题中。出现雅可比矩阵的自然环境是样本协方差矩阵。它们还出现在只有两个特征值的矩阵加法或乘法的简单问题中。

This is the matrix equivalent of the beta distribution. A 1 × 1 Wishart is a beta distributed number. This is intimately related to Jacobi orthogonal polynomials. Beta distributions and Jacobi polynomials appear in problems where the variable is bounded from above and from below. A natural setting where Jacobi matrices appear is that of sample covariance matrices. They also show up in the simple problem of addition or multiplications of matrices with only two eigenvalues.

对于标量随机变量,我们研究随机矩阵系综的矩和 Stieltjes 变换。此外,由于我们处于矩阵领域,我们研究这些随机矩阵的特征值的联合概率分布。为了在前面提到的系综中,特征值是强相关的,我们可以将它们视为通过成对排斥相互作用的粒子。这些被称为库仑排斥特征值,这里的想法是从统计物理学借用的(例如,请参阅Persi Diaconis (2003) 的“特征值模式”,以更深入地研究具有特殊结构的矩阵特征值的行为)。事实证明,库仑气体问题的最可能位置与维格纳情况下埃尔米特多项式的零点以及威沙特情况下拉盖尔多项式的零点一致。此外,这些系综的特征值围绕它们最可能的值波动很小职位。

As for scalar random variables, we study the moments and the Stieltjes transform of random matrix ensembles. Moreover, since we are in the matrix realm, we study the joint probability distributions of the eigenvalues of these random matrices. For the ensembles mentioned previously, the eigenvalues are strongly correlated and we can think of them as particles interacting through pair-wise repulsion. These are called Coulomb repelling eigenvalues, and the idea here is borrowed from statistical physics (see, for example, “Patterns in Eigenvalues” by Persi Diaconis (2003) for a deeper dive into the behavior of the eigenvalues of matrices with special structures). It turns out that the most probable positions of the Coulomb gas problem coincide with the zeros of Hermite polynomials in the Wigner case, and of Laguerre polynomials in the Wishart case. Moreover, the eigenvalues of these ensembles fluctuate very little around their most probable positions.

两个大型随机矩阵之和的特征值密度

Eigenvalue Density of the Sum of Two Large Random Matrices

以外找到随机矩阵系综特征值的联合概率分布,我们关心大型随机矩阵之和的特征值密度(概率分布),就总和中的每个单独矩阵而言。戴森布朗运动出现在此背景下。它是布朗运动从标量随机变量到随机矩阵的扩展。此外,矩阵的傅立叶变换允许我们定义标量独立且同分布的随机变量的生成函数的模拟,并使用其对数来查找精心构造的随机矩阵之和的特征值密度。最后,我们可以将 Chernoff、Bernstein 和 Hoeffding 型不等式应用于随机 Hermitian 矩阵的有限和的最大特征值。

Other than finding the joint probability distribution of the eigenvalues of random matrix ensembles, we care about the eigenvalue density (probability distribution) of sums of large random matrices, in terms of each individual matrix in the sum. Dyson Brownian motion appears in this context. It is an extension of Brownian motion from scalar random variables to random matrices. Moreover, a Fourier transform for matrices allows us to define the analog of the generating function for scalar independent and identically distributed random variables, and use its logarithm to find the eigenvalue density of sums of carefully constructed random matrices. Finally, we can apply Chernoff, Bernstein, and Hoeffding-type inequalities to the maximal eigenvalue of a finite sum of random Hermitian matrices.

大型随机矩阵的基本数学

Essential Math for Large Random Matrices

离开对大型随机矩阵的讨论,让我们强调一下如果我们想深入研究这个领域就必须知道的知识。我们在本章中涉及其中的一些内容,其余的则留给您的谷歌搜索技巧:

Before leaving the discussion of large random matrices, let’s highlight the must knows if we want to dive deeply into this field. We touch on some of these in this chapter and leave the rest for your googling skills:

  • 计算谱:矩阵的特征值和特征向量(解 A v = λ v

  • Computing the spectrum: eigenvalues and eigenvectors of a matrix (solutions of A v = λ v )

  • 矩阵的特征多项式 ( d e t λ - A

  • Characteristic polynomial of a matrix ( d e t ( λ I - A ) )

  • Hermite、Laguerre 和 Jacobi 正交多项式

  • Hermite, Laguerre, and Jacobi orthogonal polynomials

  • 高斯、伽玛和贝塔概率分布

  • The Gaussian, gamma, and beta probability distributions

  • 随机变量的矩和矩生成函数

  • Moments and moment generating function of a random variable

  • 斯蒂尔切斯变换

  • Stieltjes transform

  • 切比雪夫一切

  • Chebyshev everything

  • 马尔可夫一切

  • Markov everything

  • 切尔诺夫、伯恩斯坦和霍夫丁型不等式

  • Chernoff, Bernstein, and Hoeffding-type inequalities

  • 布朗运动和戴森布朗运动

  • Brownian motion and Dyson Brownian motion

截至 2022 年,最快的超级计算机是Frontier,它是世界上第一台百亿亿次计算机(1.102 exaFLOPS),位于能源部橡树岭国家实验室。当矩阵非常大时,即使在这样的超级计算机上,我们也无法应用我们所知的数值线性代数(例如,求解涉及矩阵的方程组,找到它的谱,或者找到它的奇异值分解)。我们必须做的是对矩阵的列进行随机采样。这是最好以导致最忠实的近似值(即方差最小的近似值)的概率对列进行采样。例如,如果问题是将两个大矩阵AB相乘,我们不是均匀地采样A中的列和B中的相应行,而是以概率选择A中的列和B中的相应行 p j norm(A的第j列)norm(B的第j行)成比例。这意味着我们更频繁地选择范数较大的列和行,从而更有可能捕获产品的重要部分。

As of 2022, the fastest supercomputer is Frontier, the world’s first exascale computer (1.102 exaFLOPS), at the Department of Energy’s Oak Ridge National Laboratory. When matrices are very large, even on such a supercomputer, we cannot apply numerical linear algebra (for example, to solve systems of equations involving the matrix, to find its spectrum, or to find its singular value decomposition) as we know it. What we must do instead is random sampling of the columns of the matrix. It is best to sample the columns with probability that leads to the most faithful approximation, the one with the least variance. For example, if the problem is to multiply two large matrices A and B with each other, instead of sampling a column from A and a corresponding row from B uniformly, we choose the column from A and the corresponding row from B with a probability p j proportional to norm(column j of A)norm(row j of B). This means that we choose the columns and rows with large norms more often, leading to a higher probability of capturing the important parts of the product.

大矩阵和小矩阵的列空间是如此重要。请记住给定矩阵A的列空间的三个最佳基数:

The column space of large and not large matrices is so important. Keep in mind the three best bases for the column space of a given matrix A:

  • 奇异值分解的奇异向量

  • The singular vectors from the singular value decomposition

  • 来自 Gram Schmidt 过程的正交向量(著名的矩阵QR分解)

  • The orthogonal vectors from the Gram Schmidt process (the famous QR decomposition of the matrix)

  • 直接从列中选择线性独立的列A

  • Linearly independent columns directly selected from the columns of A

随机过程

Stochastic Processes

而不是考虑静态(标量、向量、矩阵或张量)随机变量,我们现在考虑与时间相关的随机变量。不知何故,数学的下一步总是会包含随时间演化的实体。顺便说一句,人类还没有完全理解时间的本质,也没有找到一种方法来阐明它的定义。然而,我们确实理解运动和变化,即从一种状态过渡到另一种状态的系统,并且我们将时间与此联系起来。我们还将从一种状态转换到另一种状态的概率关联起来。对马尔可夫链保持这个想法,稍后会出现。

Rather than thinking of a static (scalar or vector or matrix or tensor) random variable, we now think of a time-dependent random variable. Somehow the next step in math always ends up including time-evolving entities. On a side note, humans have not yet fully understood the nature of time or found a way to articulate its definition. We do, however, understand movement and change, a system transitioning from one state to another, and we associate time with that. We also associate probabilities for transitioning from one state to another. Hold this thought for Markov chains, coming up in a bit.

随机过程是一个无限序列 X 0 , X 1 , X 2 , 随机变量,我们考虑每个变量中的索引t X t 作为离散时间。所以 X 0 是时间0处的过程(或某个随机量在某个时间0 的值), X 1 是时间1处的过程(或某个随机量在某个时间1的值),依此类推。为了正式定义一个随机变量(我们还没有这样做),我们通常将其修复为我们所说的概率三元组(样本空间、西格玛代数、概率测度)。先别担心这个三元组的含义;相反,对一个随机过程中的所有随机变量这一事实大惊小怪 X 0 , X 1 , X 2 , 生活在相同的概率三元组中,从这个意义上说,属于一个家庭。此外,这些随机变量通常不是独立的。

A stochastic process is an infinite sequence X 0 , X 1 , X 2 , . . . of random variables, where we think of the index t in each X t as discrete time. So X 0 is the process at time 0 (or the value of a random quantity at a certain time 0), X 1 is the process at time 1 (or the value of a random quantity at a certain time 1), and so on. To formally define a random variable, which we have not done yet, we usually fix it over what we call a probability triple (a sample space, a sigma algebra, a probability measure). Do not fuss about the meaning of this triple yet; instead, fuss about the fact that all the random variables in one stochastic process X 0 , X 1 , X 2 , . . . live over the same probability triple, in this sense belonging to one family. Moreover, these random variables are often not independent.

同样重要(取决于应用)的是连续时间随机过程,其中 X t 现在对任意非负时间t的随机量的值进行编码。此外,这很容易与我们对时间的直觉感知一致,因为时间是连续的。

Equally important (depending on the application) is a continuous time stochastic process, where X t now encodes the value of a random quantity at any nonnegative time t. Moreover, this is easy to align with our intuitive perception of time, which is continuous.

因此,随机过程是有限维多变量分布到无限维的推广。当试图证明随机过程的存在时,这种思考方式会派上用场,因为那时我们可以求助于定理,这些定理允许我们通过依赖成分分布的有限维集合来扩展到无限维。

Therefore, a stochastic process is a generalization of a finite dimensional multivariable distribution to infinite dimensions. This way of thinking about it comes in handy when trying to prove the existence of a stochastic process, since then we can resort to theorems that allow us to extend to infinite dimensions by relying on finite dimensional collections of the constituent distributions.

随机过程的例子在我们周围随处可见。每当我们遇到波动时,我们都会想到这些:气体分子的运动、电流的波动、金融市场的股票价格、某个时间段内呼叫中心的电话数量、或者赌徒的过程。一篇讨论肠道中发现的细菌的微生物学论文有一个有趣的发现:血液、跳蚤或躯体的群落组装主要受随机过程控制,而肠道微生物组则由确定性过程决定

Examples of stochastic processes are all around us. We think of these whenever we encounter fluctuations: movement of gas molecules, electrical current fluctuations, stock prices in financial markets, the number of phone calls to a call center in a certain time period, or a gambler’s process. And here’s an interesting finding from a microbiology paper discussing bacteria found in the gut: the community assembly of blood, fleas, or torsalos is primarily governed by stochastic processes, while the gut microbiome is determined by deterministic processes.

股票市场的例子是随机过程理论的核心,因为正是布朗运动(也称为维纳随机过程)的普及,L. Bachelier 研究了巴黎证券交易所的价格变化。呼叫中心的电话示例也是该理论的核心,因为泊松随机过程就是通过 AK Erlang 对特定时间段内发生的电话数量进行建模而得到普及的。

The stock market example is central in the theory of stochastic processes, because it is how Brownian motion (also called the Wiener stochastic process) got popularized, with L. Bachelier studying price changes in the Paris Bourse. The example of phone calls to a call center is also central in the theory, because it is how Poisson stochastic process got popularized, with A. K. Erlang modeling the number of phone calls occurring in a certain period of time.

这两个布朗过程和泊松过程出现在许多其他与前面提到的示例无关的设置中。也许这告诉我们一些关于自然及其基本过程的统一性的更深入的东西,但我们不要陷入哲学并停留在数学上。一般来说,我们可以根据随机过程的数学特性将其分为几类。其中一些是离散时间过程,另一些是连续时间过程。两者之间的区别非常直观。

These two processes, Brownian and Poisson, appear in many other settings that have nothing to do with the previously mentioned examples. Maybe this tells us something deeper about nature and the unity of its underlying processes, but let’s not get philosophical and stay with mathematics. In general, we can group stochastic processes into a few categories, depending on their mathematical properties. Some of these are discrete time processes, and others are continuous time. The distinction between which one is which is pretty intuitive.

为了得出有关布朗过程和泊松过程以及我们将要概述的其他随机过程的结论,我们需要对它们进行数学分析。在概率论中,我们首先确定随机过程的存在。也就是说,我们需要明确定义概率三元组(样本空间、西格玛代数、概率测度),其中离散时间随机变量的无限序列 X 0 , X 1 , X 2 , 或连续时间 X t 过程生命,并证明我们可以找到这样一组满足其特征属性的随机变量。我们将在本章后面重新讨论这个问题,但是在证明随机过程的存在时我们想要寻找的一个大人物是 A. Kolmogorov (1903-1987),即柯尔莫哥洛夫存在定理。这确保了随机过程的存在,该过程具有与我们期望的过程相同的有限维分布。也就是说,我们可以通过以某种一致的方式指定所有有限维分布来获得我们想要的随机过程(无限过程,在离散或连续时间上索引)。

To derive conclusions about Brownian and Poisson processes, and the other stochastic processes that we are about to overview, we need to analyze them mathematically. In probability theory, we start with establishing the existence of a stochastic process. That is, we need to explicitly define the probability triple (sample space, sigma algebra, probability measure) where the discrete time infinite sequence of random variables X 0 , X 1 , X 2 , . . . or the continuous time X t process lives, and prove that we can find such a set of random variables satisfying its characterizing properties. We will revisit this later in this chapter, but a big name we want to search for when proving the existence of stochastic processes is A. Kolmogorov (1903–1987), namely, the Kolmogorov existence theorem. This ensures the existence of a stochastic process having the same finite dimensional distributions as our desired processes. That is, we can get our desired stochastic process (infinite process, indexed on discrete or continuous time) by specifying all finite dimensional distributions in some consistent way.

我们来调查一下最突出的随机过程。

Let’s survey the most prominent stochastic processes.

伯努利过程

Bernoulli Process

这是随机过程主要与反复抛硬币以及生活中任何类似的过程有关(在某些机场,海关官员让我们按下按钮:如果灯变绿,我们就通过,如果灯变红,我们就会被搜查)。从数学上讲,它是独立且同分布的随机变量的无限序列 X 0 , X 1 , X 2 , ,其中每个随机变量要么以概率p取值 0 ,要么以概率1-p取值 1 。此过程的示例实现如下所示 0 , 1 , 1 , 0 ,

This is the stochastic process mostly associated with repeatedly flipping a coin and any process in life that mimics that (at some airports, customs officials make us press a button: if the light turns green we pass, if it turns red we get searched). Mathematically, it is an infinite sequence of independent and identically distributed random variables X 0 , X 1 , X 2 , . . . , where each random variable takes either the value zero with probability p or one with probability 1-p. A sample realization of this process would look like 0 , 1 , 1 , 0 , . . . .

泊松过程

Poisson Process

我们可以将泊松过程视为一个随机过程,其基础随机变量是计数变量。这些计数在设定的时间内发生了多少有趣的事件。这些事件要么是独立的,要么是弱相关的,并且每个事件发生的概率都很小。它们也以设定的预期速率发生 λ 。这是表征泊松随机变量的参数。例如,在排队论中,我们用它来模拟顾客到达商店、呼叫中心的电话或特定时间间隔内地震的发生。该过程以自然数作为其状态空间,以非负数作为其索引集。泊松过程中涉及的随机变量的概率分布具有以下公式:

We can think of the Poisson process as a stochastic process whose underlying random variables are counting variables. These count how many interesting events happen within a set period of time. These events are either independent or are weakly dependent, and each has a small probability of occurrence. They also happen at a set expected rate λ . This is the parameter that characterizes a Poisson random variable. For example, in queuing theory, we use it to model the arrival of customers at a store, phone calls at a call center, or the occurrence of earthquakes in a certain time interval. This process has the natural numbers as its state space and the nonnegative numbers as its index set. The probability distribution underlying the random variables involved in a Poisson process has the formula:

r X = n = λ n e -λ n

该公式给出了单位时间内发生n个有趣事件的概率。显然,在固定的时间间隔内,许多罕见事件不太可能发生,这解释了大n的公式中的快速衰减。泊松随机变量的期望和方差为 λ

The formula gives the probability of n interesting events occurring in a unit period of time. Clearly, within a fixed time interval it is not likely for many rare events to occur, which explains the rapid decay in the formula for a large n. The expectation and the variance of a Poisson random variable is λ .

泊松过程 X t ; t 0 ,按连续时间索引,具有以下属性:

A Poisson process X t ; t 0 , indexed by continuous time, has the properties:

  • X 0 = 0

  • X 0 = 0

  • 任何长度为t的区间内事件(或点)的数量是具有参数的泊松随机变量 λ t

  • The number of events (or points) in any interval of length t is a Poisson random variable with parameter λ t

泊松过程有两个重要特征:

A Poisson process has two important features:

  • 每个有限区间内的事件数是泊松随机变量(具有泊松概率分布)。

  • The number of events in each finite interval is a Poisson random variable (has a Poisson probability distribution).

  • 不相交时间间隔中的事件数量是独立的随机变量。

  • The number of events in disjoint time intervals are independent random variables.

泊松过程是 Levy 随机过程的一个示例,它是一个具有平稳独立的过程增量。

The Poisson process is an example of a Levy stochastic process, which is a process with stationary independent increments.

随机游走

Random Walk

这是很容易将最简单的随机游走视为某人在从某处开始的路上迈出一步,然后以概率p向前移动(在其位置上加一),并以概率1 – p向后移动(从其位置中减去一)。我们可以定义由此产生的离散时间随机过程 X 0 , X 1 , 这样 X 0 = X 0 , X 1 = X 0 + Z 1 , X 2 = X 1 + Z 2 = X 0 + Z 1 + Z 2 等,其中 Z 1 , Z 2 , 是伯努利过程。如果p = 0.5,则这是对称随机游走。

It is easy to think of the simplest random walk as someone taking steps on a road where they start somewhere, then move forward (add one to their position) with probability p, and backward (subtract one from their position) with probability 1 – p. We can define the resulting discrete time stochastic process X 0 , X 1 , such that X 0 = x 0 , X 1 = X 0 + Z 1 , X 2 = X 1 + Z 2 = X 0 + Z 1 + Z 2 , etc., where Z 1 , Z 2 , is a Bernoulli process. If p = 0.5, this is a symmetric random walk.

第 9 章中,我们多次在图上使用随机游走,从某个图节点开始,然后以给定的概率过渡到其相邻节点之一。图的归一化邻接矩阵将定义所有节点的转移概率。这是一个很好的例子,说明图上的随机游走如何与马尔可夫链联系起来,很快就会出现。有关这方面的更多信息,请查看这组关于图表随机游走的精彩注释。

In Chapter 9, we used random walks on graphs multiple times, where we start at a certain graph node and then transition to one of its adjacent nodes with given probabilities. A normalized adjacency matrix of the graph would define the transition probabilities at all nodes. This is a neat example of how random walks on graphs tie to Markov chains, coming up soon. For more on this, check this nice set of notes on random walks on graphs.

维纳过程或布朗运动

Wiener Process or Brownian Motion

我们可以将维纳过程或布朗运动视为具有无限小的步长的随机游走,因此离散运动变成无限小的波动,并且我们得到连续的随机游走。布朗运动是连续时间随机过程 X t ; t 0 。随机变量 X t 是实值,具有独立增量,并且之间的差异 X t X s 在两个不同的时间ts呈正态分布(遵循高斯钟形分布),均值为0,方差为t – s。那是, X t - X s 基于增量的大小正态分布。

We can think of a Wiener process or a Brownian motion as a random walk with infinitesimally small steps, so discrete movements become infinitesimally small fluctuations, and we get a continuous random walk. A Brownian motion is a continuous time stochastic process X t ; t 0 . The random variables X t are real valued, have independent increments, and the difference between X t and X s at two separate times t and s is normally distributed (follows a Gaussian bell-shape distribution) with mean 0 and variance t – s. That is, X t - X s are normally distributed based on the size of the increments.

实值连续时间随机过程的有趣之处在于它可以沿连续路径移动,从而产生有趣的时间随机函数。例如,几乎可以肯定的是,布朗运动(维纳过程)的样本路径在任何地方都是连续的,但在任何地方都不可微(尖峰太多)。

The interesting thing about a real valued continuous time stochastic process is that it can move in continuous paths, giving rise to interesting random functions of time. For example, almost surely, a sample path of a Brownian motion (Wiener process) is continuous everywhere, but nowhere differentiable (too many spikes).

布朗运动是随机过程研究的基础。它是随机微积分的起点,位于几个重要过程类别的交叉点:它是高斯马尔可夫过程、Levy 过程(具有平稳独立增量的过程)和鞅(接下来讨论)。

Brownian motion is fundamental in the study of stochastic processes. It is the starting point of stochastic calculus, lying at the intersection of several important classes of processes: it is a Gaussian Markov process, a Levy process (a process with stationary independent increments), and a martingale, discussed next.

Martingale

离散型时间鞅是一个随机过程 X 0 , X 1 , X 2 , 其中对于任意离散时间 t:

A discrete time martingale is a stochastic process X 0 , X 1 , X 2 , . . . where for any discrete time t:

𝔼 X t+1 | X 1 , X 2 , , X t = X t

也就是说,考虑到所有先前的观测值,下一个观测值的预期值等于最近的观测值。这是一种奇怪的定义方式(遗憾的是,这在这个领域很常见),但让我们给出一些鞅出现的上下文的简短示例:

That is, the expected value of the next observation, given all the previous observations, is equal to the most recent observation. This is a weird way of defining something (sadly, this is very common in this field), but let’s give a few brief examples of some contexts within which martingales appear:

  • 无偏随机游走是鞅的一个例子。

  • An unbiased random walk is an example of a martingale.

  • 如果赌徒玩的所有投注游戏都是公平的,那么赌徒的财富就是鞅。假设如果一枚公平的硬币正面朝上,一个赌徒赢 1 美元,如果反面朝上,他输 1 美元。如果 X n 是赌徒在n次抛硬币后的财富,那么在给定历史的情况下,赌徒在下一次抛硬币后的条件预期财富等于他们当前的财富。

  • A gambler’s fortune is a martingale if all the betting games that the gambler plays are fair. Suppose a gambler wins $1 if a fair coin comes up heads, and loses $1 if it comes up tails. If X n is the gambler’s fortune after n tosses, then the gambler’s conditional expected fortune after the next coin toss, given the history, is equal to their present fortune.

  • 在生态群落中,一群物种争夺资源,我们可以将任何特定物种的个体数量建模为随机过程。该序列是生物多样性和生物地理学统一中性理论下的鞅。

  • In an ecological community, where a group of species compete for resources, we can model the number of individuals of any particular species as a stochastic process. This sequence is a martingale under the unified neutral theory of biodiversity and biogeography.

讨论鞅时会出现停止时间。这是一个有趣的概念,它体现了这样的想法:在任何特定时间 t ,您可以查看到目前为止的序列并判断是否是时候停止。随机过程的停止时间 X 1 , X 2 , X 3 , 是一个随机变量S(对于stop),其属性是对于每个 t,事件S = t的发生或不发生仅取决于以下值 X 1 , X 2 , X 3 , , X t 。例如,停止时间随机变量对赌徒选择停止并离开赌桌的时间进行建模。这将取决于他们之前的输赢,但不取决于他们还没有参加的比赛的结果玩过。

Stopping times appear when discussing martingales. This is an interesting concept, capturing the idea that: at any particular time t, you can look at the sequence so far and tell if it is time to stop. A stopping time with respect to a stochastic process X 1 , X 2 , X 3 , . . . is a random variable S (for stop) with the property that for each t, the occurrence or nonoccurrence of the event S = t depends only on the values of X 1 , X 2 , X 3 , . . . , X t . For example, the stopping time random variable models the time at which a gambler chooses to stop and leave a gambling table. This will depend on their previous winnings and losses, but not on the outcomes of the games that they haven’t yet played.

征税流程

Levy Process

我们有提到泊松过程和布朗运动(维纳过程)是 Levy 过程的两个最流行的示例。这是一个具有独立、平稳增量的随机过程。它可以模拟连续位移随机的粒子的运动,其中成对、不相交的时间间隔内的位移是独立的,并且相同长度的不同时间间隔内的位移具有相同的概率分布。从这个意义上说,它是随机游走的连续时间模拟。

We have mentioned the Poisson process and Brownian motion (Wiener process) as two of the most popular examples of a Levy process. This is a stochastic process with independent, stationary increments. It can model the motion of a particle whose successive displacements are random, in which displacements in pair-wise, disjoint time intervals are independent, and displacements in different time intervals of the same length have identical probability distributions. In this sense, it is the continuous time analog of a random walk.

分支过程

Branching Process

一个分支过程随机分裂成分支。例如,它模拟某个种群的进化(如细菌或核反应堆中的中子),其中给定一代中的每个个体根据一些与个体无关的固定概率分布,在下一代中产生随机数量的个体。对个人。分支过程理论的主要问题之一是最终灭绝的概率,即种群在有限的世代后灭绝。

A branching process randomly splits into branches. For example, it models a certain population’s evolution (like bacteria, or neutrons in a nuclear reactor) where each individual in a given generation produces a random number of individuals in the next generation, according to some fixed probability distribution that does not vary from individual to individual. One of the main questions in the theory of branching processes is the probability of ultimate extinction, where the population dies out after a finite number of generations.

马尔可夫链

Markov Chain

让我们正式定义离散时间马尔可夫链,因为它是最重要的随机过程之一,并且因为它出现在人工智能强化学习的背景下。为了定义马尔可夫链,我们需要:

Let’s formally define a discrete time Markov chain, since it is one of the most important stochastic processes, and because it comes up in the context of reinforcement learning in AI. To define a Markov chain, we need:

  • 一组离散的可能状态S(有限或无限)。将其视为粒子或代理可以占据的状态集。在每一步,马尔可夫过程随机地从一种状态演变到另一种状态。

  • A discrete set of possible states S (finite or infinite). Think of this as the set of states that a particle or an agent can occupy. At each step, the Markov process randomly evolves from one state to another.

  • 规定概率的初始分布 ν 每一个可能的 s t A t e 。最初,粒子在初始化时处于某个位置或代理处于某个状态的可能性有多大?

  • An initial distribution prescribing the probability ν i of each possible s t a t e i . Initially, how likely is it for a particle to be at a certain location or for an agent to be in a certain state upon initialization?

  • 转移概率 p j 指定粒子或代理从 s t A t e s t A t e j 。请注意,对于每个状态i,我们有总和 p 1 + p 2 + + p n = 1 。而且,这个过程没有记忆,因为这个转移概率只取决于 s t A t e s t A t e j ,而不是之前访问过的州。

  • Transition probabilities p ij specifying the probability that the particle or the agent transitions from s t a t e i to s t a t e j . Note that we have, for each state i, the sum p i1 + p i2 + + p in = 1 . Moreover, this process has no memory, because this transition probability depends only on s t a t e i and s t a t e j , not on previously visited states.

现在,马尔可夫链是一个随机过程 X 0 , X 1 , S中的值,使得:

Now, a Markov chain is a stochastic process X 0 , X 1 , taking values in S such that:

r X 0 = s t A t e 0 , X 1 = s t A t e 1 , , X n = s t A t e n = ν 0 p 0 1 p 1 2 p n-1 n

我们可以将所有转移概率捆绑在一个方阵(马尔可夫矩阵,每行的非负数加起来为 1)中,然后乘以该矩阵。该矩阵总结了从任何状态转换的概率 s t A t e s t A t e j 一步之遥。很酷的事情是,马尔可夫矩阵的幂也是马尔可夫,因此例如,该矩阵的平方总结了从任意状态转移的概率 s t A t e s t A t e j 分两步进行,依此类推。

We can bundle up all the transition probabilities in a square matrix (a Markov matrix, with each row having nonnegative numbers adding up to 1), and multiply by that matrix. This matrix summarizes the probabilities of transitioning from any s t a t e i to s t a t e j in one step. The cool thing is that the powers of a Markov matrix are also Markov, so for example, the square of this matrix summarizes the probabilities of transitioning from any s t a t e i to s t a t e j in two steps, and so on.

与马尔可夫链相关的一些基本概念是瞬态、递归和不可约性。A s t A t e 如果我们从它开始,那么我们最终一定会回到它。如果它不是反复发生的,则称为瞬态。如果马尔可夫链可以从任何状态移动到任何其他状态,则该链是不可约的。

Some fundamental notions related to Markov chains are transience, recurrence, and irreducibility. A s t a t e i is recurrent if we start from it, then we will certainly eventually return to it. It is called transient if it is not recurrent. A Markov chain is irreducible if it is possible for the chain to move from any state to any other state.

最后,定义可能状态的概率分布的平稳概率向量,当我们将其乘以转移矩阵时不会改变。这直接与线性代数中的特征向量和相应的特征值一联系起来。当我们所知道的数学联系在一起时,我们会产生一种令人上瘾的感觉,也是让我们陷入对这个领域又爱又恨的关系中的原因。值得庆幸的是,我们待得越久,就越变成。

Finally, a stationary probability vector, defining a probability distribution over the possible states, is one that does not change when we multiply it by the transition matrix. This ties straight into eigenvectors in linear algebra with corresponding eigenvalue one. The feel good high that we get when the math we know connects together is an addictive feeling and is what keeps us trapped in a love-hate relationship with this field. Thankfully, the longer we stay, the more love-love it becomes.

伊藤引理

Itô’s Lemma

让我们打成平手一起多学一点数学。随机过程模拟随时间变化的随机量。随机过程的函数也会随时间随机演化。当确定性函数随时间演化时,下一个问题通常是,有多快?为了回答这个问题,我们取其相对于时间的导数,并围绕确定性函数的导数(和积分)开发微积分。链式法则至关重要,尤其是对于训练机器学习模型而言。

Let’s tie a bit more math together. A stochastic process models a random quantity that evolves with time. A function of a stochastic process also evolves randomly with time. When deterministic functions evolve with time, the next question is usually, how fast? To answer that, we take its derivative with respect to time, and develop calculus around the derivatives (and integrals) of deterministic functions. The chain rule is of paramount importance, especially for training machine learning models.

伊藤引理类似于随机过程函数的链式法则。它是链式法则的随机微积分对应物。我们用它来求随机随机函数的微分过程。

Itô’s lemma is the analog of the chain rule for functions of stochastic processes. It is the stochastic calculus counterpart of the chain rule. We use it to find the differential of a time-dependent function of a stochastic process.

马尔可夫决策过程和强化学习

Markov Decision Processes and Reinforcement Learning

在人工智能社区中,马尔可夫决策过程与以下方面相关:

In the AI community, Markov decision processes are associated with:

动态规划和理查德·贝尔曼
Dynamic programming and Richard Bellman

贝尔曼在该领域发挥了巨大的作用,他的最优条件在许多算法中得到了实现。

Bellman played a monumental role in the field, and his optimality condition is implemented in many algorithms.

强化学习
Reinforcement learning

通过一系列与积极或消极奖励(试验和错误)相关的行动来寻找最佳策略。代理可以在多个操作和转换状态之间进行选择,其中转换概率取决于所选操作。

Finding an optimal strategy via a sequence of actions that are associated with positive or negative rewards (trials and errors). The agent has choices between several actions and transition states, where the transition probabilities depend on the chosen actions.

深度强化学习
Deep reinforcement learning

将强化学习与神经网络相结合。在这里,神经网络将观察结果作为输入,并输出代理可以采取的每个可能动作的概率(概率分布)。然后,代理根据估计的概率随机决定下一步行动。例如,如果智能体有两个选择,左转或右转,神经网络输出左转为 0.7,则智能体将以 70% 的概率左转,以 30% 的概率右转。

Combining reinforcement learning with neural networks. Here, a neural network takes the observations as input, and outputs a probability for each possible action that the agent can take (a probability distribution). The agent then decides on the next action randomly, according to the estimated probabilities. For example, if the agent has two choices, turn left or turn right, and the neural network outputs 0.7 for turn left, then the agent will turn left with 70% probability, and turn right with 30% probability.

强化学习具有推动我们迈向通用智能的巨大潜力。当行动的回报不是立即的而是一系列连续采取的行动的结果时,智能代理需要做出理性的决策。这就是不确定性下推理的缩影。

Reinforcement learning has great potential to advance us toward general intelligence. Intelligent agents need to make rational decisions when payoffs from actions are not immediate but instead result from a series of actions taken sequentially. This is the epitome of reasoning under uncertainty.

强化学习的例子

Examples of Reinforcement Learning

例子强化学习的应用有很多:自动驾驶汽车、推荐系统、家庭恒温器(当接近目标温度并节省能源时获得正奖励,当人类需要调整温度时获得负奖励)以及自动投资股票市场(输入是股票价格,输出是买入或卖出每只股票的数量,奖励是货币收益或损失)。

Examples of reinforcement learning are plentiful: self-driving cars, recommender systems, thermostat at home (getting positive rewards whenever it is close to the target temperature and saves energy, and negative rewards when humans need to tweak the temperature), and automatic investing in the stock market (the input is the stock prices, the output is how much of each stock to buy or sell, the rewards are the monetary gains or losses).

也许深度强化学习成功的最著名例子是DeepMind 的 AlphaGo,这款人工智能代理曾于 2016 年在中国古代围棋游戏中击败了世界上最好的人类棋手。从棋盘游戏(例如国际象棋或围棋)角度思考强化学习是很直观的,因为在每一步中,我们都决定必须采取的行动顺序,并且非常清楚我们当前的决定会影响游戏的整个结果。我们最好在每一步中采取最佳行动。此外,在每一步中,我们的最优策略都会演变,因为它还取决于对手的行动(他们正在解决完全相同的问题,但从他们的有利角度)。

Perhaps the most famous example for deep reinforcement learning success is DeepMind’s AlphaGo, the AI agent that in 2016 beat the world’s best human player in the ancient Chinese game of Go. Thinking of reinforcement learning in terms of board games, such as chess or Go, is intuitive, because at each step we decide on the sequence of actions that we must take, knowing very well that our current decision affects the whole outcome of the game. We better act optimally at each step. Moreover, at each step, our optimal strategy evolves, because it also depends on the actions of our opponent (who is solving the exact same problem but from their vantage point).

我对包含游戏的例子有点偏见,因为现在我女儿沉迷于 PlayStation 5。我更喜欢投资市场的例子。我们的财务顾问在每天都在变化的市场中运作,需要在每个时间步骤做出买入/卖出某些股票的决策,其长期目标是利润最大化和损失最小化。市场环境是随机的,我们不知道它的规则,但出于建模目的,我们假设我们知道它的规则。现在让我们把人类财务顾问换成人工智能代理,看看在不断变化的市场中,这个代理在每个时间步需要解决什么样的优化问题环境。

I am a bit biased against examples containing games, since nowadays my daughter is addicted to PlayStation 5. I prefer the investment market example. Our financial adviser operates in a market that changes every day, and needs to make decisions on buying/selling certain stocks at each time step, with the long-term goal of maximizing profit and minimizing losses. The market environment is stochastic and we do not know its rules, but we assume that we do for our modeling purposes. Now let’s switch our human financial adviser to an AI agent, and let’s see what kind of optimization problem this agent needs to solve at each time step in the constantly changing market environment.

强化学习作为马尔可夫决策过程

Reinforcement Learning as a Markov Decision Process

让我们用数学方法将强化学习表述为马尔可夫决策过程。我们的代理存在的环境是概率性的,由状态和这些状态之间的转换概率组成。这些转移概率取决于所选择的操作。因此,产生的马尔可夫过程对任何状态的转换进行编码 stAte 到另一个州 stAte ' 对操作有明确的依赖性 A 处于状态时拍摄的 stAte

Let’s formulate reinforcement learning mathematically as a Markov decision process. The environment within which our agent exists is probabilistic, consisting of states and transition probabilities between these states. These transition probabilities depend on the chosen actions. Thus, the resulting Markov process that encodes the transitions from any state state to another state state ' has an explicit dependency on the action a taken while in state state .

这里的主要假设是我们知道这个过程,这意味着我们知道环境的规则。换句话说,我们知道每个的以下概率 stAte , stAte ' ,和行动 A :

The main assumption here is that we know this process, which means that we know the rules of the environment. In other words, we know the following probability for each state , state ' , and action a :

r 下一个 状态 = stAte ' | 当前的 状态 = stAte , 行动 采取 = A

我们还知道奖励制度,即:

We also know the reward system, which is:

r 下一个 报酬 价值 | 当前的 状态 = stAte , 行动 采取 = A , 下一个 状态 = stAte '

现在这个讨论属于动态规划领域:我们寻找导致最优价值(最大奖励或最小损失)的最优策略(良好行动的序列)。这个优化问题比我们迄今为止遇到的问题要复杂一些,因为它是导致最优值的一系列操作。因此,我们必须将问题分成步骤,寻找每一步的行动,以在未来的多个步骤中获得最佳奖励。 贝尔曼最优方程求解正是这个问题,假设我们知道每一步要优化什么问题,则将其简化为在当前状态下仅搜索一个最佳操作(而不是一次搜索所有操作)。贝尔曼的巨大贡献在于以下断言:当前状态的最优值等于采取一个最优动​​作后的平均奖励,加上该动作可能导致的所有可能的下一个状态的预期最优值。

This discussion now belongs in the realm of dynamic programming: we search for the optimal policy (sequence of good actions) leading to optimal value (maximal reward or minimal loss). This optimization problem is a bit more involved than the ones we have encountered so far, because it is a sequence of actions that leads to the optimal value. Therefore, we must divide the problem into steps, looking for the action at each step that shoots for the optimal reward multiple steps ahead in the future. Bellman’s optimality equation solves exactly this problem, simplifying it into the search for only one optimal action at the current state (as opposed to searching for all of them at once), given that we know what problem to optimize at each step. Bellman’s huge contribution is the following assertion: the optimal value of the current state is equal to the average reward after taking one optimal action, plus the expected optimal value of all possible next states that this action can lead to.

代理通过迭代过程与其环境交互。它从初始状态和该状态的一组可能的操作(在给定该状态下采取操作的概率分布)开始,然后迭代计算以下内容:

The agent interacts with its environment via an iterative process. It starts with an initial state and a set of that state’s possible actions (the probability distribution for taking an action given that state), then computes the following, iteratively:

  1. 要采取的下一个最佳操作(将其转换到具有一组新的可能操作的新状态)。这称为策略迭代,优化目标是最大化未来奖励。

  2. The next optimal action to take (which transitions it to a new state with a new set of possible actions). This is called the policy iteration, and the optimization goal is to maximize future reward.

  3. 给定最佳行动的预期价值(奖励或损失)。这称为值迭代

  4. The expected value (rewards or losses) given that optimal action. This is called the value iteration.

价值函数将智能体在给定其当前状态和之后采取的最佳行动序列的情况下的预期未来奖励相加:

The value function adds up the agent’s expected future rewards given its current state and the optimal sequence of actions taken afterward:

V A e stAte , 最佳的 顺序 行动 = 𝔼 Σ k γ k r e w A r d k

折扣系数 γ 是 0 到 1 之间的数字。鼓励采取能够尽早而不是稍后带来积极回报的行动很有用。将此因素放入优化问题中会随着时间的推移调整奖励的重要性,从而减少未来奖励的权重(如果 G A A 介于 0 和 1 之间,那么 G A A k 对于大k来说很小)。

The discount factor γ is a number between 0 and 1. It is useful to encourage taking actions that result in sooner rather than later positive rewards. Putting this factor in the optimization problem adjusts the importance of rewards over time, giving less weight to future rewards (if g a m m a is between 0 and 1, then g a m m a k is small for large k).

让我们明确价值函数中的优化(我们选择在给定当前状态下最大化奖励的动作序列):

Let’s make the optimization in the value function explicit (we are choosing the sequence of actions that maximizes the rewards given the current state):

V A e s = 最大限度 行动状态 𝔼 Σ k=0 无穷大 γ k rewArd k | stAte 0 = s

现在我们将其分解,以确保代理当前的奖励是明确的,并且与未来的奖励分开:

Now we break this up to make sure the agent’s current reward is explicit and separate from its future rewards:

V A e s = 最大限度 行动状态 𝔼 r e w A r d 0 + Σ k=1 无穷大 γ k rewArd k | stAte 1 = s '

最后,我们发现智能体当前状态的价值函数取决于其当前奖励和未来状态的贴现价值函数:

Finally, we find that the value function at the agent’s current state depends on its current reward and a discounted value function at its future states:

V A e s = 最大限度 行动状态 𝔼 r e w A r d 0 + γ V A e s '

该语句允许我们迭代地解决我们的主要优化问题(时间上向后)。智能体现在要做的就是选择达到下一个最佳状态的动作。价值函数的这个表达式是强大的贝尔曼方程贝尔曼最优条件,它将原始优化问题分解为更简单的优化问题的递归序列,在每个状态进行局部优化(发现 V A e s ' ),然后将结果放入下一个优化子问题(找到 V A e s )。奇迹在于,以这种方式从期望的最终奖励到决定现在采取什么行动,为我们提供了总体最优策略以及最优价值函数每个州。

That statement allows us to solve our main optimization problem iteratively (backward in time). All the agent has to do now is to choose the action to get to the next best state. This expression for the value function is the powerful Bellman’s equation or Bellman’s optimality condition, which breaks up the original optimization problem into a recursive sequence of much simpler optimization problems, optimizing locally at each state (finding V a l u e ( s ' ) ), then putting the result into the next optimization subproblem (finding V a l u e ( s ) ). The miracle is that working backward this way from the desired ultimate reward to deciding what action to take now gives us the overall optimal strategy along with the optimal value function at each state.

最优控制和非线性动力学背景下的强化学习

Reinforcement Learning in the Context of Optimal Control and Nonlinear Dynamics

在关于偏微分方程的第 13 章中,我们在非线性动力学、最优控制和 Hamilton-Jacobi-Bellman 偏微分方程的背景下重新审视强化学习。与我们的代理在前面的讨论中交互的概率马尔可夫环境不同,强化学习的动态规划方法(导致 Hamilton-Jacobi-Bellman 偏微分方程)是确定性的。

In Chapter 13 on PDEs, we revisit reinforcement learning in the context of nonlinear dynamics, optimal control, and the Hamilton-Jacobi-Bellman partial differential equation. Unlike the probabilistic Markov environment that our agent interacted with in the previous discussion, the dynamic programming approach to reinforcement learning (which leads to the Hamilton-Jacobi-Bellman partial differential equation) is deterministic.

用于强化学习的 Python 库

Python Library for Reinforcement Learning

最后,TF-Agents 库(由 Google 开发,于 2018 年开源)是实现强化学习算法的一个有用的库,它是一个基于强化学习的库在 TensorFlow(Python)上。

Finally, a helpful library for implementing reinforcement learning algorithms is the TF-Agents library (by Google, open sourced in 2018), a reinforcement learning library based on TensorFlow (Python).

理论和严谨的基础

Theoretical and Rigorous Grounds

严谨,或者从数学上来说,概率论需要精确的测度论。但为什么?你可能会合理地抗议。毕竟,我们已经成功避免这种情况很长时间了。

Rigorous, or mathematically precise, probability theory needs measure theory. But why? you might rightfully protest. After all, we have managed to avoid this for the longest time.

因为我们已经无法再回避它了。

Because we cannot avoid it any longer.

让我们写下这个,但永远不要大声承认:它是测度论让许多学生放弃了进一步学习数学,主要是因为它的故事从来没有按时间顺序讲述它是如何以及为何出现的。此外,测度论概率中的大量工作都与证明某个随机变量的存在有关(在某些样本空间、事件空间或西格玛代数上,以及该西格玛代数中每个事件或集合的测度),如如果写下随机变量并用它来模拟各种随机实体还不够存在。这想必就是数学家和哲学家相处得如此融洽的原因。

Let’s write this but never admit to it out loud: it is measure theory that turns off many students from pursuing further studies in math, mostly, because its story is never told in chronological order about how and why it came to be. Moreover, a lot of work in measure theoretic probability has to do with proving that a certain random variable exists (over some sample space, an event space or a sigma algebra, and a measure for each event or set in that sigma algebra), as if writing the random variable down and using it to model all sorts of random entities is not enough existence. This must be the reason why mathematicians and philosophers get along so well.

我们已经以闪电般的速度浏览了本章中的许多概念(我的学生一直指责我这一点),但我们需要重新开始并给出:

We have already flown through many concepts in this chapter at lightning speed (my students accuse me of this all the time), but we need to start over and give:

  • 对概率和西格玛代数的精确数学理解

  • A precise mathematical understanding of probabilities and sigma algebras

  • 随机变量和概率分布的精确数学定义

  • A precise mathematical definition of a random variable and a probability distribution

  • 随机变量期望值的精确数学定义及其与积分的联系

  • A precise mathematical definition of an expected value of a random variable, and its connection to integration

  • 概率不等式概述(控制不确定性)

  • An overview of probability inequalities (controlling uncertainty)

  • 大数定律、中心极限定理和其他收敛定理概述

  • An overview of the law of large numbers, the central limit theorem, and other convergence theorems

好吧,这也太野心了吧。我们不可能在一章的某一节中讲授完整的严格概率论课程。相反,我们要做的是提出一个令人信服的理由,然后对基本思想有一个很好的理解。

Alright, that is too ambitious. We cannot give a full course on rigorous probability theory in one section of one chapter. What we will do instead is make a convincing case for it, and leave with a decent understanding of the fundamental ideas.

我们从非严格概率的两个主要限制开始(除了每个数学对象都有无数不一致的名称、符号和模糊这一事实)定义)。

We start with two major limitations of nonrigorous probability (other than the fact that each of its math objects has countless inconsistent names, notations, and fuzzy definitions).

哪些事件有概率发生?

Which Events Have a Probability?

给定一个样本空间(我们可以从中随机采样的集合),任何子集都可以定义其概率吗?如果我们从实数线上均匀采样数字并询问,我们选择有理数的概率是多少?代数数(具有整数系数的多项式方程的解)?或者是来自实系其他一些复杂子集的成员?

Given a sample space (a set that we can sample randomly from), can any subset have a probability defined on it? What if we are sampling numbers uniformly from the real line and ask, what is the probability that we pick a rational number? An algebraic number (the solution to some polynomial equation with integer coefficients)? Or a member from some other complicated subset of the real line?

看看这些问题如何慢慢地把我们带入实线上集合论的细节,这反过来又把我们直接带入测度论:这个理论解决了我们可以测量实线的哪些子集,以及不能测量哪些子集

See how these questions are slowly drawing us into the details of set theory on the real line, which in turn pulls us straight into measure theory: the theory that addresses which subsets of the real line we can measure, and which subsets we cannot.

定义样本空间子集的概率听起来很像定义该集合的度量,而且似乎只有可测量的子集才能定义它们定义的概率。样本空间的其他不可测量子集怎么样?对他们来说太糟糕了,我们无法为他们定义概率。重申一下,Prob(A)对于样本空间的每个子集 A 都没有意义;相反,它只对该空间的可测量子集有意义。因此,我们必须将所有可测量的子集结合在一起,放弃其余的,永远不要考虑它们或它们的数学,然后放松,因为这样我们就可以在我们利用的所有事件(子集)都定义了概率(测量)的领域中采取行动。他们。我们使用的概率测度满足合理的属性,即它是 [0,1] 中的非负数,并且互补事件(子集)的概率加起来为 1。整个循环揭示了实线的复杂性及其子集,更一般地说,是连续体的奥秘和无限的奇迹。

Defining a probability for a subset of a sample space is starting to sound a lot like defining a measure of that set, and it seems like only subsets that are measurable can have a probability defined for them. How about the other nonmeasurable subsets of the sample space? Too bad for them, we cannot define a probability for them. To reiterate, Prob(A) does not make sense for every subset A of a sample space; instead, it only makes sense for measurable subsets of that space. So we must harness all the measurable subsets together, abandon the rest and never think about them or their mathematics, and relax, because then we can act in a realm where all the events (subsets) that we harnessed have probabilities (measures) defined for them. The probability measure that we work with satisfies reasonable properties, in the sense that it is a nonnegative number in [0,1], and probabilities of complementary events (subsets) add up to 1. This whole roundabout reveals the intricacies of the real line and its subsets, and more generally, the mysteries of the continuum and the wonders of the infinite.

严格的概率论帮助我们理解离散空间和连续空间的属性,这些属性在示例中揭示得非常简单,例如在离散集上构造离散均匀分布与在给定集合上构造连续均匀分布间隔。

Rigorous probability theory helps us appreciate the properties of both discrete and continuous spaces, revealed in examples as simple as constructing the discrete uniform distribution on a discrete set versus constructing the continuous uniform distribution on a given interval.

我们可以讨论更广泛的随机变量吗?

Can We Talk About a Wider Range of Random Variables?

另一个非严格概率的限制,即避免我们刚才描述的测度论的限制,是对其允许的随机变量种类的限制。特别是,我们到底在哪里划定离散随机变量和连续随机变量之间的界限?真的有这样的线吗?具有离散和连续方面的随机变量怎么样?举一个简单的例子,假设随机变量的值是由抛硬币决定的。如果硬币正面朝上,则为泊松分布(离散);如果硬币正面朝上,则为正态分布(连续)。从我们理解的任何一种类型的非严格意义上来说,这个新的随机变量既不是完全离散的,也不是完全连续的。那这是什么?严格的答案是:一旦我们定义了任何随机变量的依据,离散随机变量和连续随机变量之间当然就没有区别了。这是它必须立足的严格基础。样本空间由什么集合构成?该样本空间的哪些子集是可测量的?什么是概率测度?随机变量的分布是什么?这是共同点,或者任何随机变量的起点。一旦我们指定了这个基础,那么离散的、连续的或介于两者之间的任何东西就变成了一个小细节,就像回答一样简单:我们是什么集合(或者,集合的乘积)与? 一起工作?

The other limitation of nonrigorous probability, meaning the one that avoids measure theory as we just described it, is the restriction on the kinds of random variables that it allows. In particular, where exactly do we draw the line between a discrete and continuous random variable? Is there really such a line? How about random variables that have both discrete and continuous aspects? As a simple example, suppose a random variable’s value is decided by a flip of a coin. It is Poisson distributed (discrete) if the coin comes up heads, and normally distributed (continuous) if the coin comes up tails. This new random variable is neither fully discrete nor fully continuous in the nonrigorous sense that we understand either type. Then what is it? The rigorous answer is this: of course there is no distinction between discrete and continuous random variables once we define the grounds that any random variable stands on. Here’s the rigorous ground it must stand on. What set formulates the sample space? What subsets of this sample space are measurable? What is the probability measure? What is the distribution of the random variable? This is the common ground, or the starting point for any random variable. Once we specify this ground, then discrete, continuous, or anything in between becomes a small detail, as simple as answering: what set (or, say, product of sets) are we working with?

概率三元组(样本空间、西格玛代数、概率度量

A Probability Triple (Sample Space, Sigma Algebra, Probability Measure)

这一切从概率三元开始(不是真的,但这就是严格性的开始)。我们将其称为概率测度空间,因为整个样本空间的测度等于 1。即样本空间的概率为1。我们现在感觉非常先进,可以互换使用概率和度量这两个词。衡量这个词所提供的安慰是,它把我们带回到了确定性的领域。抽样是随机的,但我们可以测量任何发生的可能性(即可测量的)。

It all starts with a probability triple (not really, but this is where rigor starts). We call this a probability measure space, with the understanding that the measure of the whole sample space is equal to one. That is, the probability of the sample space is one. We are now feeling very advanced, using the words probability and measure interchangeably. The comfort that the word measure provides is that it brings us back to a deterministic realm. The sampling is random, but we can measure the likelihood of any occurrence (that is measurable).

组成概率测度空间的三个对象是:

The three objects making up a probability measure space are:

样本空间
The sample space

我们随机从中抽取样本的任意非空集合。

The arbitrary nonempty set that we randomly pull samples from.

西格玛代数
The sigma algebra

一套代表允许事件的样本空间子集(我们可以谈论其概率的事件,因为它们是我们唯一能够测量的事件)。西格玛代数必须包含整个样本空间,在补集下是封闭的(意味着如果一个集合在西格玛代数中,那么它的补集也是),并且在可数并集下封闭(意味着西格玛代数的可数多个子集的并集是也是西格玛代数的成员)。前两个性质和德摩根定律(与并集和交集的补集有关)的推论是西格玛代数在可数交集下也是闭的。

A set of subsets of the sample space that represent the allowed events (the events that we are allowed to talk to about their probability, because they are the only ones we are able to measure). A sigma algebra must contain the whole sample space, is closed under complements (meaning if a set is in the sigma algebra then so is its complement), and is closed under countable unions (meaning the union of countably many subsets of the sigma algebra is also a member of the sigma algebra). The corollary from the previous two properties and De Morgan’s laws (which have to do with complements of unions and intersections) is that the sigma algebra is also closed under countable intersections.

概率测度
The probability measure

A与sigma 代数的每个子集相关的 0 到 1(含)之间的数字,满足我们与非严格概率相关的合理属性:

  1. 概率(样本空间)= 1

  2. Prob(成对不相交集合的可数并集) = 每个集合的概率的可数和

A number between zero and one (inclusive) associated with each subset of the sigma algebra, that satisfies the reasonable properties that we associate with nonrigorous probability:

  1. Prob(sample space) = 1

  2. Prob(countable union of pair-wise disjoint sets) = countable sum of probabilities of each set

这非常好,因为只要我们能够阐明样本空间集、西格玛代数以及具有前面提到的将西格玛代数的每个成员映射到其测度(概率)的属性的函数,那么我们就可以开始构建该理论有坚实的基础,定义了各种随机变量、它们的期望、方差、条件概率、和与积、序列极限、随机过程、随机过程函数的时间导数(伊藤微积分)等等。我们不会遇到为它们定义了概率的事件类型(概率三元组的西格玛代数的所有成员),或者我们可以考虑什么类型的随机变量(任何我们可以在概率三元组上严格定义的随机变量)的问题。 )。

This is very good, because as long as we are able to articulate the sample space set, the sigma algebra, and a function with the previously mentioned properties mapping every member of the sigma algebra to its measure (probability), then we can start building the theory on solid grounds, defining all kinds of random variables, their expectations, variances, conditional probabilities, sums and products, limits of sequences, stochastic processes, time derivatives of functions of stochastic processes (Itô’s calculus), and so on. We would not run into problems of what type of events have probabilities defined for them (all the members of the sigma algebra of the probability triple), or what type of random variables we can consider (any that we can rigorously define over a probability triple).

困难在哪里?

Where Is the Difficulty?

请注意,当我们涉及连续变量或样本空间处于连续统(不可数)时,我们讨论的非严格概率的局限性就会出现。如果我们的世界只是离散的,我们就不会经历所有这些麻烦。当我们转向严格概率并尝试为离散样本空间构造概率三元组时,我们不会遇到太多麻烦。挑战出现在具有无数样本空间的连续世界中。因为突然之间,我们必须在无限连续统的深度永远令人着迷的集合上识别西格玛代数和相关的概率测度。例如,即使当我们想要为区间 [0,1] 上的连续均匀分布定义严格的概率三元组时,也会出现这种挑战。

Note that the limitations of nonrigorous probability that we discussed both appear when we involve continuous variables, or when the sample space is in the continuum (uncountable). Had our world been only discrete, we wouldn’t be going through all this trouble. When we move to rigorous probability and attempt to construct probability triples for discrete sample spaces, we do not run into much trouble. The challenges appear in the continuum world, with uncountable sample spaces. Because suddenly we have to identify sigma algebras and associated probability measures on sets where the depth of the infinite continuum never ceases to fascinate. For example, this challenge appears even when we want to define a rigorous probability triple for the continuous uniform distribution on the interval [0,1].

可拓定理成立帮助我们构建复杂的概率三元组。我们不是在大量西格玛代数上定义概率度量,而是在一组更简单的子集(半代数)上构造它,然后该定理允许我们自动将度量扩展到完整的西格玛代数。这个定理允许我们构造 [0, 1] 上的勒贝格测度(这正是 [0,1] 上的连续均匀分布)、乘积测度、多维勒贝格测度以及有限和无限抛硬币。

The extension theorem runs to our aid and allows us to construct complicated probability triples. Instead of defining a probability measure over a massive sigma algebra, we construct it on a simpler set of subsets, a semialgebra, then the theorem allows us to automatically extend the measure to a full sigma algebra. This theorem allows us to construct Lebesgue measure on [0, 1] (which is exactly the continuous uniform distribution on [0,1]), product measures, the multidimensional Lebesgue measure, and finite and infinite coin tossing.

集合论、实分析和概率的世界已经融合在一起整齐地在一起。

The worlds of set theory, real analysis, and probability have blended neatly together.

随机变量、期望和积分

Random Variable, Expectation, and Integration

现在我们可以将概率三元组与样本空间关联起来,定义样本空间的大量子集(关联的西格玛代数的所有成员)的概率,我们可以严格定义一个随机变量。正如我们从非严格概率中非常清楚地知道的那样,随机变量为样本空间的每个元素分配一个数值。因此,如果我们将样本空间视为某个实验的所有可能的随机结果(抛硬币的正面和反面),那么随机变量就会为每个结果分配一个数值。

Now that we can associate probability triple with a sample space, defining probabilities for a large amount of subsets of the sample space (all the members of the associated sigma algebra), we can rigorously define a random variable. As we know very well from nonrigorous probability, a random variable assigns a numerical value to each element of the sample space. So if we think of the sample space as all the possible random outcomes of some experiment (heads and tails of flipping a coin), then a random variable assigns a numerical value to each of these outcomes.

为了建立在严格的基础上,我们必须定义随机变量Y如何与与样本空间相关的整个概率三元组相互作用。简短的答案是:Y必须是从样本空间到实线的可测量函数,即集合 -1 - 无穷大 , y 是 sigma 代数的成员,这又意味着该集合具有概率度量。请注意,Y从样本空间映射到实数线,并且 -1 从实线映射回样本空间的子集。

To build on rigorous grounds, we must define how a random variable Y interacts with the whole probability triple associated with the sample space. The short answer is that: Y must be a measurable function from the sample space to the real line, in the sense that the set Y -1 ( - , y ) is a member of the sigma algebra, which in turn means that this set has a probability measure. Note that Y maps from the sample space to the real line, and Y -1 maps back from the real line to a subset of the sample space.

就像来自非严格概率的随机变量在严格概率论中被证明是一个可测量的函数(相对于三元组)一样,期望 𝔼 随机变量的积分与随机变量(可测量函数)关于概率测度的积分相同。我们写:

Just like a random variable from nonrigorous probability turns out to be a measurable function (with respect to a triple) in rigorous probability theory, the expectation 𝔼 ( Y ) of a random variable turns out to be the same as the integral of the random variable (measurable function) with respect to the probability measure. We write:

𝔼 = Ω d = Ω ω r d ω

理解期望公式中的积分符号

Understanding the Integral Notation in the Expectation Formula

如果我们将离散设置中随机变量的期望值视为随机变量乘以假设该值的集合的概率:

It is easy to understand the integral with respect to a probability measure, such as the one in the previously mentioned formula, if we think of the meaning of the expectation of a random variable in a discrete setting, as the sum of the value of the random variable times the probability of the set over which it assumes that value:

𝔼 = Σ =1 n y r ω ε Ω 这样的 ω = y

现在将此离散表达式与期望公式中的连续积分进行比较:

Now compare this discrete expression to the continuum integral in the expectation formula:

𝔼 = Ω d = Ω ω r d ω

我们严格建立积分(期望),就像我们在测度论第一门课程中建立勒贝格积分一样:首先是简单的随机变量(我们可以轻松地将其分解为离散和;积分从和开始),然后是非负随机变量,最后是一般随机变量。我们可以轻松证明积分的基本属性,例如线性和保序。请注意,无论样本空间是离散的、连续的还是任何复杂的东西,只要我们有概率三元组作为基础,积分就有意义(在比我们想象的基本莱曼式微积分更广泛的设置范围内)一体化)。一旦我们遇到勒贝格式的整合,我们就永远不会回头。

We rigorously build up the integral (expectation) the exact same way we build up the Lebesgue integral in a first course on measure theory: first for simple random variables (which we can easily break up into a discrete sum; integrals start from sums), then for nonnegative random variables, and finally for general random variables. We can easily prove basic properties for integrals, such as linearity and order preserving. Note that whether the sample space is discrete, continuous, or anything complicated, as long as we have our probability triple to build on, the integral makes sense (in a much wider range of setting than we ever imagined for our basic calculus Reimann-style integration). Once we encounter Lebesgue-style integration, we sort of never look back.

现在我们有了期望,我们可以按照与非严格概率论完全相同的方式定义方差和协方差。

Now that we have the expectation, we can define the variance and covariance exactly the same way as nonrigorous probability theory.

然后我们可以讨论独立性以及重要的属性,如果XY是独立的,E ( XY ) = E ( X ) E ( Y ) 且Var ( X + Y ) = Var ( X ) + Var ( Y )。

Then we can talk about independence, and important properties such that if X and Y are independent, then E(XY) = E(X)E(Y) and Var(X + Y) = Var(X) + Var(Y).

随机变量的分布和变量定理的变化

Distribution of a Random Variable and the Change of Variable Theorem

随机变量X的分布是对应的概率三元组 , , μ 定义在实线上,这样对于定义在实线上的Borel sigma 代数的每个子集B ,我们有:

The distribution of a random variable X is a corresponding probability triple ( , , μ ) defined on the real line, such that for every subset B of the Borel sigma algebra defined on the real line, we have:

μ = X ε = X -1

这完全是由累积分布函数决定的, F X X = X X X

This is completely determined by the cumulative distribution function, F X ( x ) = P ( X x ) , of X.

假设我们有一个定义在实线上的可测实值函数f 。X为概率三元组上的随机变量 Ω , s G A A G e r A , 有分布 μ 。请注意,对于任何实数xf ( x ) 是实数,对于随机变量Xf ( X ) 是随机变量。

Suppose we have a measurable real valued function f defined on the real line. Let X be a random variable on a probability triple ( Ω , s i g m a a l g e b r a , P ) with distribution μ . Note that for any real number x, f(x) is a real number, and for the random variable X, f(X) is a random variable.

变量定理的变化表示随机变量f(X)相对于样本空间上的概率测度P的期望值 Ω 等于函数f关于度量的期望值 μ 。让我们先用期望来写,然后用积分来写:

The change of variable theorem says that the expected value of the random variable f(X) with respect to the probability measure P on a sample space Ω is equal to the expected value of the function f with respect to the measure μ on . Let’s write this first in terms of expectation and then in terms of integrals:

𝔼 F X = 𝔼 μ F
Ω F X ω d ω = -无穷大 无穷大 F t μ d t

变量变化定理带来的一个好处是我们可以在期望、积分和概率之间切换。令f为可测量子集的指示函数 (它是子集上的 1,否则为零),则公式给出:

A nice thing that comes in handy from this change of variables theorem is that we can switch between expectations, integrations, and probabilities. Let f be the indicator function of a measurable subset of (which is one over the subset and zero otherwise), then the formula gives us:

-无穷大 无穷大 1 μ d t = μ = X ε

请注意,在第 8 章中,我们遇到了概率变量定理的另一个变化,该定理将随机变量的概率分布与其确定性函数的概率分布联系起来,使用该变量的雅可比行列式的行列式功能转换。

Note that in Chapter 8, we encountered another change of variables theorem from probability, which relates the probability distribution of a random variable with the probability distribution of a deterministic function of it, using the determinant of the Jacobian of this function transformation.

严格概率论的后续步骤

Next Steps in Rigorous Probability Theory

严格概率论的下一步是证明著名的不等式(马尔可夫、切比雪夫、柯西-施瓦茨、詹森),引入随机变量的和与积、大数定律和中心极限定理。然后我们转向随机变量序列和极限定理。

The next step in rigorous probability theory is to prove the famous inequalities (Markov, Chebyshev, Cauchy-Schwarz, Jensen’s), introduce sums and products of random variables, the laws of large numbers, and the central limit theorem. Then we move to sequences of random variables and limit theorems.

极限定理

Limit theorems

要是我们有一个收敛到某个极限随机变量的随机变量序列,是否可以得出该序列的期望收敛到极限的期望?在积分语言中,极限和积分什么时候可以互换?

If we have a sequence of random variables that converges to some limit random variable, does it follow that the expectations of the sequence converge to the expectation of the limit? In integral language, when can we exchange the limit and the integral?

这是我们证明单调收敛、有界收敛、法图引理、支配收敛和一致可积收敛定理的时候。

This is when we prove the monotone convergence, the bounded convergence, Fatou’s lemma, the dominated convergence, and the uniformly integrable convergence theorems.

最后,我们考虑二重或更高的积分,以及何时可以翻转积分的条件。富比尼定理回答了这个问题,我们可以应用它来给出独立随机变量之和分布的卷积公式。

Finally we consider double or higher integrals, and conditions on when it is OK to flip integrals. Fubini’s theorem answers that, and we can apply it to give a convolution formula for the distribution of a sum of independent random variables.

神经网络的普遍性定理

The Universality Theorem for Neural Networks

严格测度论(概率论)帮助我们证明神经网络的定理,神经网络是一个新兴的数学子领域,旨在为许多实证人工智能的成功提供理论基础。

Rigorous measure theory (probability theory) helps us prove theorems for neural networks, which is an up-and-coming subfield of mathematics, aiming to provide theoretical grounds for many empirical AI successes.

神经网络的普遍性定理是一个起点。我们在本书中多次提到它。声明如下:

The universality theorem for neural networks is a starting point. We have referred to it multiple times in this book. Here’s the statement:

对于紧集 K 上的任何连续函数 f,存在一个仅具有单个隐藏层的前馈神经网络,该网络将 f 统一逼近到任意范围内 ε > 0 在 K 上。

For any continuous function f on a compact set K, there exists a feed forward neural network, having only a single hidden layer, which uniformly approximates f to within an arbitrary ϵ > 0 on K.

这个网页有一个很好且易于理解的证明。

This web page has a nice and easy-to-follow proof.

总结与展望

Summary and Looking Ahead

在本章中,我们调查了对人工智能、机器学习和数据科学很重要的概率概念。我们快速浏览了人工智能中的因果建模、悖论、大型随机矩阵、随机过程和强化学习等主题。

In this chapter, we surveyed concepts in probability that are important for AI, machine learning, and data science. We zipped through topics such as causal modeling, paradoxes, large random matrices, stochastic processes, and reinforcement learning in AI.

当我们学习概率时,我们常常会跌倒关于不确定性的定义和总体哲学,分为频率论与客观论立场。以下是对每个观点的简洁描述:

Often when we learn about probability, we fall into the frequentist versus the objectivist positions regarding the definitions and the overall philosophy surrounding uncertainty. The following is a neat description of each viewpoint:

常客立场_
A frequentist position

概率只能来自实验和观察重复试验的结果。

Probabilities can only come from experiments and observing the results of repeated trials.

客观主义立场
An objectivist position

概率是宇宙的真实方面:以特定方式行事的真实倾向或自然倾向。例如,一枚公平的硬币在 50% 的时间内正面朝上的倾向是该公平硬币本身的固有属性。

Probabilities are real aspects of the universe: a real inclination or natural tendency to behave in a particular way. For example, a fair coin’s propensity to turn up heads 50% of the time is an intrinsic property of the fair coin itself.

那么,频率论者只是试图通过实验来测量这些自然倾向。严格的概率论统一了不同的概率观点。我们很快引入了严格的概率论,并确定它在本质上与实分析中的测度论相同。我们以神经网络的万能逼近定理结束。

A frequentist is then only attempting to measure these natural inclinations via experiments. Rigorous probability theory unifies disparate views of probability. We swiftly introduced rigorous probability theory and established that it is in essence the same as measure theory in real analysis. We ended with the universal approximation theorem for neural networks.

我们用一条完美契合的推文结束这一章来自Yann LeCun 的文章,它恰好涉及我们在本章中讨论的每个主题:

We leave this chapter with a perfectly fitting tweet from Yann LeCun, which happens to touch on every topic we covered in this chapter:

我相信我们需要找到新的概念,让机器能够:

  • 像婴儿一样观察,了解世界是如何运作的。

  • 学习预测一个人如何通过采取行动影响世界。

  • 学习允许在抽象表示空间中进行长期预测的分层表示。

  • 正确对待世界并不是完全可预测的事实。

  • 使智能体能够预测一系列动作的效果,以便能够推理和可规划的机器进行分层规划,将复杂的任务分解为子任务。

  • 所有这些都与基于梯度的学习兼容。

I believe we need to find new concepts that would allow machines to:

  • Learn how the world works by observing like babies.

  • Learn to predict how one can influence the world through taking actions.

  • Learn hierarchical representations that allow long-term predictions in abstract representation spaces.

  • Properly deal with the fact that the world is not completely predictable.

  • Enable agents to predict the effects of sequences of actions so as to be able to reason and plannable machines to plan hierarchically, decomposing a complex task into subtasks.

  • All of this in ways that are compatible with gradient-based learning.

第12章数理逻辑

Chapter 12. Mathematical Logic

人类违反规则。

H。

Humans bend the rules.

H.

从历史上看,在人工智能领域,基于逻辑的代理出现在机器学习和基于神经网络的代理之前。我们在逻辑之前讨论机器学习、神经网络、概率推理、图表示和运筹学的原因是,我们希望将它们全部绑定到代理内的推理叙述中,而不是将逻辑视为旧的神经网络作为现代的。我们希望将最近的进展视为增强逻辑人工智能代理表达和推理世界的方式。思考这个问题的一个好方法类似于启蒙:人工智能代理过去使用手工编码的知识库和手工编码的规则的严格规则进行推理来做出推理和决策,然后突然它得到启发并被赋予更多的推理工具、网络、和神经元,使其能够扩展其知识库和推理方法。这样,它就具有更强的表达能力,可以应对更复杂和不确定的情况。此外,结合所有工具将使代理可以选择有时打破更严格的逻辑框架的规则,并根据情况采用更灵活的逻辑框架,就像人类一样。弯曲、破坏甚至改变规则是人类特有的属性。

Historically in the AI field, logic-based agents come before machine learning and neural network–based agents. The reason we went over machine learning, neural networks, probabilistic reasoning, graph representations, and operations research before logic is that we want to tie it all into one narrative of reasoning within an agent, as opposed to thinking of logic as old and neural networks as modern. We want to view the recent advancements as enhancing the way a logical AI agent represents and reasons about the world. A good way to think about this is similar to enlightenment: an AI agent used to reason using the rigid rules of handcoded knowledge base and handcoded rules to make inferences and decisions, then suddenly it gets enlightened and becomes endowed with more reasoning tools, networks, and neurons that allow it to expand both its knowledge base and inference methods. This way, it has more expressive power and can navigate more complex and uncertain situations. Moreover, combining all the tools would allow an agent the option to sometimes break the rules of a more rigid logic framework and employ a more flexible one, depending on the situation, just like humans. Bending, breaking, and even changing the rules are distinctive human attributes.

逻辑一词的字典含义为本章定下了基调,并证明了其进展的合理性。

The dictionary meaning of the word logic sets the tone for this chapter and justifies its progression.

逻辑

Logic

一个框架它组织了用于合理思考和推理的规则和过程。它是一个框架,规定了进行推理和推理的有效性原则。

A framework that organizes the rules and processes used for sound thinking and reasoning. It is a framework that lays down the principles of validity under which to conduct reasoning and inference.

在这个定义中最需要注意的词是推理的框架原则。逻辑系统将控制可靠推理和正确证明的原则编入代理中。人工智能的核心是设计能够收集知识、利用灵活的逻辑系统进行逻辑推理(适应其所处环境的不确定性)并根据逻辑推理做出推论和决策的代理。

The most important words to pay attention to in this definition are framework and principles for inference. A logic system codifies within an agent the principles that govern reliable inference and correct proofs. Designing agents that are able to gather knowledge, reason logically with a flexible logic system that accommodates uncertainty about the environment that they exist in, and make inferences and decisions based on this logical reasoning lies at the heart of artificial intelligence.

我们讨论了可以编程到代理中的各种数理逻辑系统。目标是让人工智能代理能够做出推断,使其能够采取适当的行动。这些逻辑框架需要知识库来伴随不同规模的推理规则。他们还具有不同程度的表达能力和演绎能力。

We discuss the various systems of mathematical logic that we can program into an agent. The goal is to give the AI agent the ability to make inferences that enable it to act appropriately. These logical frameworks require knowledge bases to accompany the inference rules of varying sizes. They also have varying degrees of expressive and deductive powers.

各种逻辑框架

Various Logic Frameworks

为了我们将在本章中重点介绍的每一个不同的逻辑框架(命题一阶时间概率模糊),我们将回答两个问题,即它们如何在拥有它们的代理中运行:

For each of the different logical frameworks (propositional, first order, temporal, probabilistic, and fuzzy) that we are about to highlight in this chapter, we will answer two questions about how they operate within an agent endowed with them:

  1. 代理的世界中存在哪些对象?意思是,智能体如何感知其世界的组成?

  2. What objects exist in the agent’s world? Meaning, how does the agent perceive the composition of its world?

  3. 代理如何感知对象的状态?也就是说,在特定的逻辑框架下,代理可以为其世界中的每个对象分配什么值?

  4. How does the agent perceive the objects’ states? Meaning, what values can the agent assign to each object in its world under the particular logic framework?

如果我们将我们的智能体比作一只蚂蚁以及它如何体验世界,那么就很容易想到这一点。由于蚂蚁预先确定的感知框架和允许的运动,蚂蚁体验到的世界及其曲率是二维的。如果蚂蚁得到增强并被赋予更具表现力的感知框架和允许的运动(例如翅膀),它将体验三维世界。

It is easy to think about this if we liken our agent to an ant and how it experiences the world. Because of the ant’s predetermined framework of perception and allowed movements, the ant experiences the world, along with its curvature, as two-dimensional. If the ant gets enhanced and endowed with a more expressive framework of perception and allowed movements (for example, wings), it will experience the three-dimensional world.

命题逻辑

Propositional Logic

这里我们的代理问题的答案:

Here are the answers to our agent questions:

代理的世界中存在哪些对象?
What objects exist in the agent’s world?

简单或复杂的陈述,称为命题,因此称为命题逻辑

Simple or complex statements, called propositions, hence the name propositional logic.

代理如何感知对象的状态?
How does the agent perceive the objects’ states?

真 (1)、假 (0) 或未知。命题逻辑也称为布尔逻辑,因为其中的对象只能呈现两种状态。命题逻辑中的悖论是不能根据命题逻辑被分类为真或假的陈述。逻辑框架的真值表

True (1), false (0), or unknown. Propositional logic is also called Boolean logic because the objects in it can only assume two states. Paradoxes in propositional logic are statements that cannot be classified as true or false according to the logic framework’s truth table.

这些是语句及其状态的示例:

These are examples of statements and their states:

  • 正在下雨(可以采用 true 或 false 状态)。

  • It is raining (can take true or false states).

  • 埃菲尔铁塔在巴黎(总是如此)。

  • The Eiffel Tower is in Paris (always true).

  • 公园内有可疑活动(可以采取真实或虚假状态)。

  • There is suspicious activity in the park (can take true or false states).

  • 这句话是错误的(悖论)。

  • This sentence is false (paradox).

  • 我很高兴我很悲伤(总是假的,除非你问我丈夫)。

  • I am happy and I am sad (always false, unless you ask my husband).

  • 我高兴悲伤(总是如此)。

  • I am happy or I am sad (always true).

  • 如果分数为 13,则该学生不及格(真相取决于不及格阈值,因此我们需要在知识库中声明:所有分数低于 16 的学生均不及格,并将其值设置为 true)。

  • If the score is 13, then the student fails (truth depends on failing thresholds, so we need a statement in the knowledge base that says: all students with a score below 16 fail, and set its value at true).

  • 1 + 2 等价于 2 + 1(在具有算术规则的智能体中始终为真)。

  • 1 + 2 is equivalent to 2 + 1 (always true within an agent endowed with arithmetic rules).

  • 巴黎是浪漫的(在命题逻辑中,这必须是真或假,但在模糊逻辑中,它可以假设一个从 0 到 1 的值,例如 0.8,这更符合我们感知世界的方式:在规模上而不是绝对值。当然,如果我正在对代理进行编程并仅限于命题逻辑,我会为此语句分配 true 值,但讨厌巴黎的人会分配 false 值。哦,好吧)。

  • Paris is romantic (in propositional logic this has to be either true or false, but in fuzzy logic it can assume a value on a zero-to-one scale, for example, 0.8, which corresponds better to the way we perceive our world: on a scale as opposed to absolutes. Of course, I would assign the value true for this statement if I am programming an agent and confined to propositional logic, but someone who hates Paris would assign false. Oh well).

命题逻辑世界中的对象是简单陈述和复杂陈述。我们可以使用简单的语句形成复杂的语句五个允许的运算符:not(否定)、andor暗示(与if then相同)和等价于(与if 且仅 if相同)。

The objects in a propositional logic’s world are simple statements and complex statements. We can form complex statements from simple ones using five allowed operators: not (negation), and, or, implies (which is the same as if then), and equivalent to (which is the same as if and only if).

我们还有五个规则来确定一个陈述是真是假:

We also have five rules to determine whether a statement is true or false:

  1. 当且仅当该陈述为假时,该陈述的否定才为真。

  2. The negation of a statement is true if and only if the statement is false.

  3. s t A t e e n t 1 s t A t e e n t 2 当且仅当两者都为真 s t A t e e n t 1 s t A t e e n t 2 是真的。

  4. s t a t e m e n t 1 and s t a t e m e n t 2 are true if and only if both s t a t e m e n t 1 and s t a t e m e n t 2 are true.

  5. s t A t e e n t 1 或者 s t A t e e n t 2 为真当且仅当 s t A t e e n t 1 或者 s t A t e e n t 2 为真(或者如果两者都为真)。

  6. s t a t e m e n t 1 or s t a t e m e n t 2 is true if and only if either s t a t e m e n t 1 or s t a t e m e n t 2 are true (or if both are true).

  7. s t A t e e n t 1 暗示 s t A t e e n t 2 为真,除非 s t A t e e n t 1 是真的并且 s t A t e e n t 2 是假的。

  8. s t a t e m e n t 1 implies s t a t e m e n t 2 is true except when s t a t e m e n t 1 is true and s t a t e m e n t 2 is false.

  9. s t A t e e n t 1 相当于_ s t A t e e n t 2 当且仅当 s t A t e e n t 1 s t A t e e n t 2 均为真或均为假。

  10. s t a t e m e n t 1 is equivalent to s t a t e m e n t 2 if and only if s t a t e m e n t 1 and s t a t e m e n t 2 are both true or both false.

我们可以在真值表中总结这些规则,说明了状态的所有可能性 s t A t e e n t 1 s t A t e e n t 2 并使用五个允许的操作员加入。在下面的真值表中,我们使用 S 1 为了 s t A t e e n t 1 S 2 为了 s t A t e e n t 2 节省空间:

We can summarize these rules in a truth table accounting for all the possibilities for the states of s t a t e m e n t 1 and s t a t e m e n t 2 and for their joining using the five allowed operators. In the following truth table, we use S 1 for s t a t e m e n t 1 and S 2 for s t a t e m e n t 2 to save space:

S 1 S 2 不是 S 1 S 1 S 2 S 1 或者 S 2 S 1 暗示 S 2 S 1 相当于 S 2

F

F

F

F

时间

T

F

F

F

F

时间

T

时间

T

F

F

时间

T

时间

T

F

F

时间

T

时间

T

F

F

时间

T

F

F

F

F

F

F

时间

T

F

F

F

F

时间

T

时间

T

F

F

时间

T

时间

T

时间

T

时间

T

我们可以使用这个真值表通过简单的递归评估来计算任何复杂语句的真值。例如,如果我们生活在一个世界中 S 1 是真的, S 2 是假的,并且 S 3 是真的,那么我们有这样的说法:

We can compute the truth of any complex statement using this truth table by simple recursive evaluation. For example, if we are in a world where S 1 is true, S 2 is false, and S 3 is true, then we have the statement:

  • 不是 S 1 S 2 或者 S 3 F (FT)= FT = F
  • not S 1 and (S 2 or S 3 ) F and (F or T) = F and T = F

为了能够使用命题逻辑推理和证明定理,建立逻辑等价,意味着具有完全相同的真值表的语句,因此它们可以在推理过程中相互替换。以下是逻辑等价的一些示例:

To be able to reason and prove theorems using propositional logic, it is helpful to establish logical equivalences, meaning statements that have the exact same truth tables so they can replace each other in a reasoning process. The following are some examples of logical equivalences:

  • 的交换性: S 1 S 2 S 2 S 1

  • Commutativity of and: S 1 and S 2 S 2 and S 1

  • 的交换性: S 1 或者 S 2 S 2 或者 S 1

  • Commutativity of or: S 1 or S 2 S 2 or S 1

  • 双重否定消除:not(not S 1 S 1

  • Double negation elimination: not (not S 1 ) S 1

  • 对立: S 1 暗示 S 2 不是( S 2 )意味着不( S 1

  • Contraposition: S 1 implies S 2 not( S 2 ) implies not( S 1 )

  • 消隐: S 1 暗示 S 2 不是( S 1 S 2

  • Implication elimination: S 1 implies S 2 not( S 1 ) or S 2

  • 德摩根定律:不( S 1 S 2 不是( S 1 不( S 2

  • De Morgan’s law: not( S 1 and S 2 ) not( S 1 ) or not( S 2 )

  • 德摩根定律:不( S 1 或者 S 2 不是( S 1 )不是( S 2

  • De Morgan’s law: not( S 1 or S 2 ) not( S 1 ) and not( S 2 )

让我们证明一下 S 1 暗示 S 2 不是( S 1 S 2 通过证明它们具有相同的真值表,因为这种等价对某些人来说并不那么直观:

Let’s demonstrate that S 1 implies S 2 not( S 1 ) or S 2 by showing that they have the same truth table, since this equivalence is not so intuitive for some people:

S 1 不是 ( S 1 S 2 不是( S 1 S 2 S 1 暗示 S 2

F

F

时间

T

F

F

时间

T

时间

T

F

F

时间

T

时间

T

时间

T

时间

T

时间

T

F

F

时间

T

时间

T

时间

T

时间

T

F

F

F

F

F

F

F

F

一个例子证明逻辑等价是如何有用的通过反证推理的方法来证明。为了证明该声明 S 1 暗示着这个陈述 S 2 ,我们可以假设我们有 S 1 但同时我们没有 S 2 ,然后我们得到一些错误或荒谬的东西,这证明我们不能假设 S 1 没有结论 S 2 以及。我们可以验证一下这种证明方式的有效性 S 1 暗示 S 2 使用命题逻辑等价:

One example that demonstrates how logical equivalences are useful is the proof by contradiction way of reasoning. To prove that the statement S 1 implies the statement S 2 , we can assume that we have S 1 but at the same time we do not have S 2 , then we arrive at something false or absurd, which proves that we cannot assume S 1 without concluding S 2 as well. We can verify the validity of this way of proving that S 1 implies S 2 using propositional logic equivalences:

  • S 1 暗示 S 2 = 真

  • S 1 implies S 2 = true

  • 不是( S 1 S 2 = true(暗示消除)

  • not( S 1 ) or S 2 = true (implication removal)

  • 不是这样的( S 1 S 2 ) = 不是(真)

  • not(not( S 1 ) or S 2 ) = not(true)

  • S 1 不是( S 2 ) = false(德摩根和双重否定)

  • S 1 and not( S 2 ) = false (De Morgan and double negation)

我们赋予一个具有推理规则的命题逻辑框架,所以我们能够从一个陈述(简单或复杂)顺序推理到下一个陈述,并达到预期目标或陈述的正确证明。这些是命题逻辑的一些推理规则:

We endow a propositional logic framework with rules of inference, so that we are able to reason sequentially from one statement (simple or complex) to the next and arrive at a desired goal or at a correct proof of a statement. These are some of the rules of inference that accompany propositional logic:

  • 如果 S 1 暗示 S 2 是真的,并且我们得到 S 1 ,那么我们可以推断 S 2

  • If S 1 implies S 2 is true, and we are given S 1 , then we can infer S 2 .

  • 如果 S 1 S 2 是真的,那么我们可以推断 S 1 。同样,我们也可以推断 S 2

  • If S 1 and S 2 is true, then we can infer S 1 . Similarly, we can also infer S 2 .

  • 如果 S 1 相当于 S 2 ,那么我们可以推断( S 1 暗示 S 2 )( S 2 暗示 S 1 )。

  • If S 1 is equivalent to S 2 , then we can infer ( S 1 implies S 2 ), and ( S 2 implies S 1 ).

  • 相反,如果 ( S 1 暗示 S 2 )( S 2 暗示 S 1 ),那么我们可以推断( S 1 相当于 S 2 )。

  • Conversely, if ( S 1 implies S 2 ) and ( S 2 implies S 1 ), then we can infer that ( S 1 is equivalent to S 2 ).

最后我们强调,命题逻辑不能扩展到大环境,也不能有效地捕获普遍的关系模式。然而,命题逻辑提供了一阶逻辑和高阶逻辑的基础,因为它们建立在命题逻辑之上机械。

We finally emphasize that propositional logic does not scale to large environments and cannot efficiently capture universal relationship patterns. However, propositional logic provides the foundation of first-order logic and higher-order logic, since those build on top of propositional logic’s machinery.

从几个公理到一个完整的理论

From Few Axioms to a Whole Theory

推理规则健全。它们使我们只能证明真实的陈述,在某种意义上,给定一个真实的陈述,如果我们能够用它推断出合理的推理规则,我们就得到了一个真实的陈述。因此,合理的推理规则所提供的保证是它们不允许从正确的陈述中推断出错误的陈述。我们需要的不仅仅是这个保证。

The inference rules are sound. They allow us to prove only true statements, in the sense that given a true statement and if we can infer a sound inference rule with it, we arrive at a true statement. Therefore, the guarantee that sound inference rules provide is that they do not allow false statements to be inferred from true ones. We need slightly more than that guarantee.

当我们能够仅使用系统的知识库(公理)及其推理规则来推断出所有可能的真实陈述时,逻辑框架就完成了。系统完整性的想法非常重要。在所有数学系统中,例如数论、概率论、集合论或欧几里得几何,我们从一组公理开始(数论和数学分析的皮亚诺公理,以及概率论的概率公理),然后我们从这些公理使用逻辑推理规则。任何数学理论中的一个主要问题是公理和推理规则是否确保其完整性和一致性。

A logical framework is complete when we are able to infer all possible true statements using only the system’s knowledge base (axioms) and its inference rules. The idea of completeness of a system is very important. In all mathematical systems, such as number theory, probability theory, set theory, or Euclidean geometry, we start with a set of axioms (Peano axioms for number theory and mathematical analysis, and probability axioms for probability theory), then we deduce theorems from these axioms using the logical rules of inference. One main question in any math theory is whether the axioms along with the rules of inference ensure its completeness and its consistency.

然而,没有一阶理论有能力唯一地描述具有无限域的结构,例如自然数或实线。完全描述这两种结构的公理系统(即分类公理系统)可以在更强的逻辑(例如二阶逻辑)中获得。

No first-order theory, however, has the strength to uniquely describe a structure with an infinite domain, such as the natural numbers or the real line. Axiom systems that do fully describe these two structures (that is, categorical axiom systems) can be obtained in stronger logics such as second-order logic.

在代理内编码逻辑

Codifying Logic Within an Agent

继续一阶逻辑,让我们回顾一下我们在具有命题逻辑的人工智能代理的背景下学到的东西。以下过程很重要,对于更具表现力的逻辑也是如此:

Before moving on to first-order logic, let’s recap what we learned in the context of an AI agent endowed with propositional logic. The following process is important and will be the same for more expressive logics:

  1. 我们以真实陈述的形式编写初始知识库(公理)。

  2. We program an initial knowledge base (axioms) in the form of true statements.

  3. 我们对推理规则进行编程。

  4. We program the inference rules.

  5. 智能体感知到有关其世界当前状态的某些陈述。

  6. The agent perceives certain statements about the current state of its world.

  7. 代理人可能有也可能没有目标陈述。

  8. The agent may or may not have a goal statement.

  9. 代理使用推理规则来推理新的语句并决定要做什么(移动到下一个房间、打开门、设置闹钟等)。

  10. The agent uses the inference rules to infer new statements and to decide what to do (move to the next room, open the door, set the alarm clock, etc.).

  11. 代理系统的完整性(知识库和推理规则)在这里很重要,因为它允许代理在给定足够的推理步骤的情况下推断出任何可满足的目标陈述。

  12. Completeness of the agent’s system (knowledge base together with the inference rules) is important here, since it allows the agent to infer any satisfiable goal statement given enough inference steps.

确定性机器学习和概率机器学习如何融合?

How Do Deterministic and Probabilistic Machine Learning Fit In?

机器学习(包括)神经网络的前提是我们不将初始知识库编程到代理中,也不将推理规则编程。相反,我们编程的是一种表示输入数据、所需输出以及将输入映射到输出的假设函数的方法。然后智能体通过优化目标函数(损失函数)来学习函数的参数。最后,代理使用它学到的函数对新的输入数据进行推断。所以在这种情况下,知识库和规则可以在学习推理过程中分离。学习过程中,知识库是数据和假设函数,目标是最小化损失,规则是优化过程。学习后,代理使用学习到的函数进行推理。

The premise of machine learning (including) neural networks is that we do not program an initial knowledge base into the agent, and we do not program inference rules. What we program instead is a way to represent the input data, the desired outputs, and a hypothesis function that maps the input to the output. The agent then learns the parameters of the function by optimizing the objective function (loss function). Finally, the agent makes inferences on new input data using the function it learned. So in this context, the knowledge base and the rules can be separated during learning or during inference. During learning, the knowledge base is the data and the hypothesis function, the goal is minimizing the loss, and the rules are the optimization process. After learning, the agent uses the learned function for inference.

如果我们用数据特征的联合概率分布代替确定性假设函数,我们就可以以完全相同的方式思考概率机器学习模型。一旦学习,代理就可以用它来进行推理。例如,贝叶斯网络对于不确定知识的作用类似于命题逻辑对于不确定知识的作用。确定的知识。

We can think of probabilistic machine learning models in exactly the same way if we replace the deterministic hypothesis function by the joint probability distribution of the features of the data. Once learned, the agent can use it for inference. For example, Bayesian networks would play a similar role for uncertain knowledge as propositional logic for definite knowledge.

一阶逻辑

First-Order Logic

让我们回答一阶逻辑的相同问题:

Let’s answer the same questions for first-order logic:

代理的世界中存在哪些对象?
What objects exist in the agent’s world?

陈述、对象以及它们之间的关系。

Statements, objects, and relations among them.

代理如何感知对象的状态?
How does the agent perceive the objects’ states?

真 (1)、假 (0) 或未知。

True (1), false (0), or unknown.

命题逻辑非常适合说明基于知识的代理如何工作,并解释某种逻辑语言的基本规则和推理规则。然而,命题逻辑在它可以表示什么知识以及如何推理它方面受到限制。例如,在命题逻辑中,语句:

Propositional logic is great for illustrating how knowledge-based agents work, and to explain the basic rules of a certain logic’s language and rules of inference. However, propositional logic is limited in what knowledge it can represent and how it can reason about it. For example, in propositional logic, the statement:

所有年满 18 岁的用户都可以看到此广告。

All users who are older than 18 can see this ad.

很容易表达为隐含语句(与if then相同),因为这种语言存在于命题逻辑框架中。这就是我们如何将陈述表达为命题逻辑中的推论:

is easy to express as an implies statement (which is the same as if then), since this kind of language exists in the propositional logic framework. This is how we can express the statement as an inference in propositional logic:

(超过 18 岁的用户意味着看到广告)(超过​​ 18 岁的用户 = T),那么我们可以推断出(看到广告 = T)。

(User older than 18 implies see the ad) and (User older than 18 = T), then we can infer that (see the ad = T).

现在让我们考虑一个稍微不同的说法:

Let’s now think of a slightly different statement:

一些 18 岁以上的用户点击了广告。

Some of the users who are older than 18 click on the ad.

突然命题逻辑的语言不足以表达陈述中的一些数量!仅依赖命题逻辑的代理必须将整个语句按原样存储在其知识库中,然后不知道如何从中推断出任何有用的信息。这意味着,假设代理获得用户确实超过 18 岁的信息,它无法预测用户是否会点击广告。

Suddenly the language of propositional logic is not sufficient to express the quantity some in the statement! An agent relying only on propositional logic will have to store the whole statement, as is, in its knowledge base, then not know how to infer anything useful out of it. Meaning, suppose the agent gets the information that a user is indeed older than 18, it cannot predict whether the user will click on the ad or not.

我们需要一种语言(或逻辑框架),其词汇包括量词,例如thereexistforall,这样我们就可以写出如下内容:

We need a language (or a logical framework) whose vocabulary includes quantifiers such as there exist and for all, so that we can write something like:

对于18 岁以上的所有用户,存在点击广告的子集。

For all users who are older than 18, there exists a subset who clicks on the ad.

这两个额外的量词正是一阶逻辑框架所提供的。词汇量的增加使我们能够更加经济地存储知识库中的内容,因为我们能够将知识分解为对象以及它们之间的关系。例如,不存储:

These two extra quantifiers are exactly what first-order logic framework provides. This increase in vocabulary allows us to be more economical in what to store in the knowledge base, since we are able to break down the knowledge into objects and relations between them. For example, instead of storing:

所有18岁以上的用户都会看到该广告;

部分18岁以上用户点击广告;

部分18岁以上的用户购买该产品;

部分点击广告的用户购买了该产品;

All the users who are older than 18 see the ad;

Some of the users who are older than 18 click on the ad;

Some of the users who are older than 18 buy the product;

Some of the users who click on the ad buy the product;

作为仅具有命题逻辑框架的代理知识库中的四个单独的语句(我们仍然不知道如何从中推断出任何有用的信息),我们可以在一阶逻辑中存储三个语句:

as four separate statements in the knowledge base of an agent with only propositional logic framework (which we still don’t know how to infer anything useful from), we can store three statements in first-order logic:

对于所有18 岁以上的用户,请参阅 ad = T;

对于所有see ad =T 的用户,存在点击广告的子集;

对于点击广告的所有用户来说,存在购买该产品的子集。

For all users who are older than 18, see ad = T;

For all users with see ad =T, there exists a subset who clicks on the ad;

For all users who click on the ad, there exists a subset who buys the product.

请注意,在命题逻辑和一阶逻辑中,仅给出这些陈述,我们将无法推断 18 岁以上的特定用户是否会点击广告或购买产品,甚至无法推断这样做的百分比,但至少在一阶逻辑中,我们拥有更简洁地表达相同知识的语言,并且能够以某种方式做出一些有用的推论。

Note that in both propositional and first-order logics, given only these statements we will not be able to infer whether a specific user who is older than 18 will click on the ad or buy the product, or even the percentage of those doing that, but at least in first-order logic we have the language to express the same knowledge more concisely, and in a way where we would be able to make some useful inferences.

一阶逻辑与命题逻辑最显着的特征是,它在其基础语言中添加了量词,例如在not 、 and 、 or之上添加了存在for all、和、蕴涵、 和 ,并且与命题逻辑中已经存在的量词等效。这个小小的补充打开了表达对象与它们的描述及其相互关系的大门。

The most distinctive feature of first-order logic from propositional logic is that it adds to its base language quantifiers such as there exist and for all on top of not, and, or, implies, and is equivalent to that already exist in propositional logic. This little addition opens the door to express objects separately from their descriptions and their relationships to each other.

命题逻辑和一阶逻辑的强大之处在于它们的推理规则独立于领域及其知识库或公理集。为了开发特定领域(例如数学领域或电路工程)的知识库,我们必须仔细研究该领域,选择词汇,然后制定支持所需推论所需的一组公理。

The powerful thing about propositional and first-order logics is that their inference rules are independent from both the domain and its knowledge base or set of axioms. To develop a knowledge base for a specific domain, such as a math field, or circuit engineering, we must study the domain carefully, choose the vocabulary, then formulate the set of axioms required to support the desired inferences.

For All 和 There 之间的关系是存在的

Relationships Between For All and There Exist

因为所有存在都通过否定相互联系。以下两个语句是等效的:

For all and there exist are connected to each other through negation. The following two statements are equivalent:

  • 所有年满 18 岁的用户都会看到该广告。

  • All users who are above 18 see the ad.

  • 18 岁以上的人没有人看不到这则广告。

  • There exists no one above 18 who doesn’t see the ad.

在命题逻辑语言中,这两个陈述翻译为:

In propositional logic language, these two statements translate to:

  • 对于所有用户(例如 user > 18)为 true 的用户,请参阅广告为 true。

  • For all users such that user > 18 is true, see the ad is true.

  • 不存在这样的用户:用户 > 18 并且看到的广告是错误的。

  • There exists no user such that user > 18 and see the ad is false.

这些是关系:

These are the relationships:

  • not(存在x使得P为真) 对于所有xP都是假的。

  • not(There exists an x such that P is true) For all x, P is false.

  • 不是(对于所有xP为真) 存在一个x使得P为假。

  • not(for all x, P is true) There exists an x such that P is false.

  • 存在一个x使得P为真 并非所有x P都是假的。

  • There exists an x such that P is true not for all x P is false.

  • 对于所有xP为真 不存在使得P为假的x

  • For all x, P is true There exists no x such that P is false.

在离开本节时,我们不能不欣赏通过转向一阶逻辑而获得的表达能力。现在这个逻辑框架足以让这样的断言和推论有意义:

We cannot leave this section without appreciating the expressive power we gained by moving to first-order logic. This logic framework is now sufficient for such assertions and inferences to make sense:

神经网络的万能逼近定理
Universal approximation theorem for neural networks

大致说来,万能逼近定理断言,对于所有连续函数,都存在一个可以按照我们希望的方式逼近该函数的神经网络。请注意,这并没有告诉我们如何构建这样的网络,它只是断言其存在。尽管如此,这个定理仍然足够强大,足以让我们对神经网络在各种应用中逼近各种输入输出函数的成功并不感到惊讶。

Roughly speaking, the universal approximation theorem asserts that for all continuous functions, there exists a neural network that can approximate the function as closely as we wish. Note that this does not tell us how to construct such a network, it only asserts its existence. Still, this theorem is powerful enough to make us unsurprised about the success of neural networks in approximating all kinds of input to output functions in all kinds of applications.

推断关系
Inferring relationships

父母和孩子之间存在逆向关系:如果 Sary 是 Hala 的孩子,那么 Hala 就是 Sary 的母亲。而且,这种关系是单向的:萨里不可能是哈拉的母亲。在一阶逻辑中,我们可以分配两个函数来指示关系:的母亲孩子,可以由HalaSary或任何其他母亲和孩子填写的变量,以及适用于所有输入的函数之间的关系变量:

Parents and children have inverse relationships to each other: if Sary is the child of Hala, then Hala is the mother of Sary. Moreover, the relationship is in one direction: Sary cannot be the mother of Hala. In first-order logic, we can assign two functions indicating the relationships: mother of and child of, variables that can be filled in by Hala and Sary or any other mother and child, and a relationship between the functions that holds for all their input variables:

对于所有 x , y,如果 mother( x , y ) = T 则 mother( y , x ) = F;

对于所有 x , y , mother( x , y ) 孩子(yx)。

For all x,y, if mother(x,y) = T then mother(y,x) = F;

and for all x,y, mother(x,y) child(y,x).

现在,如果我们为智能体配备这些知识并告诉它 Hala 是 Sary 的母亲,或者 mother(Hala, Sary) = T,那么它将能够回答如下查询:

Now if we equip an agent with this knowledge and tell it that Hala is the mother of Sary, or mother(Hala, Sary) = T, then it will be able to answer queries like:

  • 哈拉是莎莉的母亲吗?时间

  • Is Hala the mother of Sary? T

  • 莎莉是哈拉的母亲吗?F

  • Is Sary the mother of Hala? F

  • 萨莉是哈拉的孩子吗?时间

  • Is Sary the child of Hala? T

  • 哈拉是萨里的孩子吗?F

  • Is Hala the child of Sary? F

  • 劳拉是约瑟夫的母亲吗?未知

  • Is Laura the mother of Joseph? Unknown

请注意,我们必须将每个语句单独存储在命题逻辑世界中,这是令人难以容忍的效率低下。

Note that we will have to store each statement separately in a propositional logic world, which is outrageously inefficient.

概率逻辑

Probabilistic Logic

什么物体存在于代理的世界中吗?
What objects exist in the agent’s world?

声明。

Statements.

代理如何感知对象的状态?
How does the agent perceive the objects’ states?

陈述正确的概率值,介于 0 和 1 之间。

A probability value between 0 and 1 that a statement is true.

概率是一阶逻辑的延伸,它使我们能够量化陈述真实性的不确定性。我们不是断言某个陈述是真还是假,而是为我们对该陈述的真实性的信念程度分配一个介于 0 和 1 之间的分数。命题逻辑和一阶逻辑提供了一组推理规则,允许我们在假设其他一些陈述为真的情况下确定某些陈述的真实性。概率论提供了一组推理规则,使我们能够在给定其他陈述为真的可能性的情况下确定一个陈述为真的可能性有多大。

Probability is the extension of first-order logic that allows us to quantify our uncertainty about the truth of a statement. Rather than asserting whether a statement is true or false, we assign to the degree of our belief in the truth of the statement a score between zero and one. Propositional and first-order logics provide a set of inference rules that allow us to determine the truth of some statements, given the assumption that some other statements are true. Probability theory provides a set of inference rules that allow us to determine how likely it is that a statement is true, given the likelihood of truth of other statements.

这种处理不确定性的扩展产生了比一阶逻辑更具表现力的框架。概率公理使我们能够扩展传统的逻辑真值表和推理规则。例如,P ( A ) + P (not ( A )) = 1:如果A为真,则P ( A ) = 1 且P (not A ) = 0,这与语句的一阶逻辑一致及其否定。

This extension to dealing with uncertainty results in a more expressive framework than first-order logic. The axioms of probability allow us to extend traditional logic truth tables and inference rules. For example, P(A) + P(not (A)) = 1: if A is true, then P(A) = 1 and P(not A) = 0, which is consistent with first-order logic about a statement and its negation.

将概率论视为一阶逻辑的自然延伸,对于需要将事物连接在一起的思维来说是令人满意的,而不是将它们视为不同的事物。以这种方式看待它也自然会导致对数据的贝叶斯推理,因为当我们收集更多知识并做出更好的推理时,我们会更新代理的先验分布。这以最合乎逻辑的方式将我们所有的主题联系在一起。

Viewing probability theory as a natural extension of first-order logic is satisfying to a mind that needs to connect things together as opposed to viewing them as disparate things. Viewing it this way also naturally leads to Bayesian reasoning about data, since we update an agent’s prior distribution as we gather more knowledge and make better inferences. This binds all our subjects together in the most logical way.

模糊逻辑

Fuzzy Logic

什么物体存在于代理的世界中吗?
What objects exist in the agent’s world?

真实程度在 [0,1] 之间的陈述。

Statements with a degree of truth between [0,1].

代理如何感知对象的状态?
How does the agent perceive the objects’ states?

已知的间隔值。

A known interval value.

命题逻辑和一阶逻辑的世界是非黑即白,真或假。它们使我们能够从真实的陈述开始并推断出其他真实的陈述。此设置非常适合数学,因为一切都可能是对或错(对或错),或者对于其 SIMS 边界非常清晰的视频游戏。在现实世界中,许多陈述对于它们是完全正确 (1) 还是完全错误 (0) 可能是模糊的,这意味着它们存在于真实的范围内,而不是边缘:巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫的;巴黎是浪漫等等她很幸福黑暗骑士这部电影不错模糊逻辑允许这样做,并将值分配给介于 0 和 1 之间的语句,而不是严格的 0 或严格的 1:巴黎是浪漫的 (0.8);她很高兴(0.6);电影《黑暗骑士》很好(0.9)。

The worlds of propositional and first-order logic are black and white, true or false. They allow us to start with true statements and infer other true statements. This setting is perfect for mathematics where everything can either be right or wrong (true or false), or for a video game with very clear boundaries for its SIMS. In the real world, many statements can be vague about whether they are fully true (1) or fully false (0), meaning they exist on a scale of truth as opposed to at the edges: Paris is romantic; She is happy; the movie The Dark Knight is good. Fuzzy logic allows this and assigns values to statements between 0 and 1 as opposed to strict 0 or strict 1: Paris is romantic (0.8); She is happy (0.6); the movie The Dark Knight is good (0.9).

在真理不断变化的模糊世界中,我们如何做出推论?它绝对不像真假世界中的推理那么简单。例如,考虑到之前的真值,“巴黎很浪漫她很幸福”这句话的真实性如何?我们需要新的规则来分配这些值,并且我们需要了解上下文或域。另一种选择是词向量,我们在第 7 章中讨论过。这些向量在不同维度上承载单词的含义,因此我们可以计算表示单词Paris 的向量与表示浪漫点的向量之间的余弦相似度,并将其指定为声明Paris is Romant的真值。

How do we make inference in a vague world where truth comes on a sliding scale? It definitely is not as straightforward as inference in true-and-false worlds. For example, how true is the statement: “Paris is romantic and she is happy,” given the previous truth values? We need new rules to assign these values, and we need to know the context, or a domain. Another option is word vectors, which we discussed in Chapter 7. These vectors carry the meaning of words in different dimensions, so we can compute the cosine similarity between the vector representing the word Paris and the vector representing the point romantic, and assign that as the truth value of the statement Paris is romantic.

请注意,概率论中的可信度与模糊逻辑中的真实程度不同。在概率逻辑中,陈述本身是明确的。我们想要推断的是明确陈述为真的概率。概率论不会对不完全正确或不完全错误的陈述进行推理。我们不计算巴黎是否浪漫的概率,但我们计算一个人在随机被问到巴黎是否浪漫时回答真或假的概率。

Note that the degree of belief in probability theory is not the same as the scale of truth in fuzzy logic. In probabilistic logic, the statements themselves are unambiguous. What we want to infer is the probability that the unambiguous statement is true. Probability theory does not reason about statements that are not entirely true or false. We do not calculate the probability that Paris is romantic, but we calculate the probability that a person, when randomly asked whether Paris is romantic, would answer true or false.

关于模糊逻辑的一个有趣的事情是,它把其他逻辑中存在的两个原则踢到了一边:如果一个陈述为真,那么它的否定就是假的原则;以及两个相互矛盾的陈述不能同时成立的原则。这实际上打开了不一致和开放宇宙的大门。在某种程度上,模糊逻辑并不试图纠正模糊性;而是试图纠正模糊性。相反,它拥抱它并利用它来允许在一个有边界的世界中运作不清楚。

One interesting thing about fuzzy logic is that it kicks two principles present in other logics to the curb: the principle that if a statement is true, then its negation is false; and the principle that two contradictory statements cannot be true at the same time. This actually opens the door to inconsistency and open universe. In a way, fuzzy logic doesn’t attempt to correct vagueness; instead, it embraces it and leverages it to allow functioning in a world where the boundaries are unclear.

时态逻辑

Temporal Logic

其他类型的特殊目的逻辑,其中某些对象(例如本节中的时间)受到特别关注,具有自己的公理和推理规则,因为它们是需要表示的知识及其推理的核心。时态逻辑将时间依赖性以及关于时间依赖性的公理和推理规则置于其结构的最前面,而不是将包含时间信息的语句添加到知识库中。在时序逻辑中,陈述或事实在某些时间是正确的,这些时间可以是时间点或时间间隔,并且这些时间是有序的。

There are other types of special purpose logics, where certain objects, such as time in this section, are given special attention, having their own axioms and inference rules, because they are central to the knowledge that needs to be represented and the reasoning about it. Temporal logic puts time dependence and the axioms and inference rules about time dependence at the forefront of its structure, as opposed to adding statements that include time information to the knowledge base. In temporal logic, statements or facts are true at certain times, which could be time points or time intervals, and these times are ordered.

代理的世界中存在哪些对象?
What objects exist in the agent’s world?

陈述、对象、关系、时间。

Statements, objects, relations, times.

代理如何感知对象的状态?
How does the agent perceive the objects’ states?

真 (1)、假 (0) 或未知。

True (1), false (0), or unknown.

在时态逻辑中,我们可以表示如下语句:

In temporal logic, we can represent statements such as:

  • 闹钟在早上 7:00 时响起

  • The alarm goes off when it is 7:00 a.m.

  • 每当向服务器发出请求时,最终都会授予访问权限,但永远不能向两个同时请求授予访问权限。

  • Whenever a request is made to a server, access is eventually granted, but it can never be granted to two simultaneous requests.

与人类自然语言的比较

Comparison with Human Natural Language

我们花费整章都讲述了逻辑系统,这些系统能够表达人类自然语言似乎毫不费力就能表达的知识。我刚刚只使用英语而不是任何其他技术语言写了一本关于数学的书。我们该怎么做呢?人类是如何表征和拓展自己的知识库的,自然语言又是利用了哪些规则来进行表征和推理,才能够如此具有表现力呢?此外,使用的特定自然语言并不重要:任何多语言说话者都知道这个想法,但不一定知道他们使用哪种特定语言来表达该想法。人们知道或想要表达的内容有一种内在的非语言表征。它是如何工作的,我们如何解开它的秘密并将其提供给我们的机器?

We spent the whole chapter going through logical systems that are able to express knowledge that humans’ natural languages seem to do effortlessly. I just wrote a whole book about math using only the English language, as opposed to any other technical language. How do we do it? How do humans represent and expand their knowledge base, and what rules does natural language use for representation and reasoning that it is able to be so expressive? Moreover, the particular natural language used is not important: any multilingual speaker knows the thought but not necessarily which particular language they are using to express that thought. There is an internal nonverbal representation for what people know or want to express. How does it work, and how can we unlock its secrets and give them to our machines?

与人类语言类似,如果我们用两种不同的形式逻辑表示相同的知识,那么我们可以推断出相同的事实(假设逻辑具有完整性推理规则)。唯一的区别在于哪个逻辑框架提供了更简单的推理途径。

Similar to human language, if we represent the same knowledge in two different formal logics, then we can infer the same facts (assuming the logics have completeness inference rules). The only difference would be which logic framework provides an easier route for inference.

也就是说,人类自然语言在许多情况下都允许歧义,并且如果没有数学的形式及其所采用的形式逻辑,就无法做出绝对的数学断言。我们无法要求无法使用 GPS 系统的人预测特定日期从华盛顿开车到纽约所需的确切时间,但我们可以向 GPS 机器询问这种准确性。

That said, human natural language allows for ambiguity on many occasions and cannot make absolute mathematical assertions without the formality of mathematics and the formal logic it employs. We cannot ask a human who has no access to a GPS system to predict the exact time it takes to drive from DC to NYC on a specific day, but we can ask a GPS machine for that kind of accuracy.

机器和复杂的数学推理

Machines and Complex Mathematical Reasoning

数学推理是人类逻辑的升华。当我们用数学推理证明一个定理时,我们离普遍真理又近了一步。教机器证明数学定理,甚至更雄心勃勃地生成新定理,需要导航无限的搜索空间和符号推理。事实再次证明,神经网络对于推进智能机器非常有用。Meta、阿姆斯特丹自由大学和巴黎高科大学 CERMICS 的研究人员结合使用深度学习、在线训练、变压器(大型语言模型)和强化学习来进行自动化数学定理证明。他们的论文 《HyperTree Proof Search for Neural Theorem Proving》(2022)展示了该领域的最新成果。

Mathematical reasoning is the distillation of human logic. When we prove a theorem using mathematical reasoning, we get one step closer to the universal truth. Teaching machines to prove mathematical theorems—and even more ambitiously, generating new ones—requires navigating infinite search spaces and symbolic reasoning. Once again, neural networks are proving useful for advancing intelligent machines. Researchers at Meta, Vrije Universiteit Amsterdam, and CERMICS École des Ponts ParisTech used a combination of deep learning, online training, transformers (large language models), and reinforcement learning for automated mathematical theorem proving. Their paper, HyperTree Proof Search for Neural Theorem Proving (2022) presents state-of-the-art results in this field.

总结与展望

Summary and Looking Ahead

具有各种类型逻辑的人工智能代理可以表达关于世界的知识、推理它、回答查询并在这些逻辑的范围内做出允许的推论。

An AI agent endowed with various types of logic can express knowledge about the world, reason about it, answer queries, and make inferences that are allowed within the boundaries of these logics.

我们讨论了各种逻辑框架,包括命题逻辑、一阶逻辑、概率逻辑、模糊逻辑和时态逻辑。

We discussed various logic frameworks, including propositional logic, first-order logic, probabilistic logic, fuzzy logic, and temporal logic.

下一个自然问题是:哪些内容应该进入代理的知识库?以及如何表示有关世界的事实?知识应该在什么框架中表示和推理?

The next natural questions would be: what content should go into an agent’s knowledge base? And how to represent facts about the world? In what framework should knowledge be represented and inference made?

  • 命题逻辑?

  • Propositional logic?

  • 一阶逻辑?

  • First-order logic?

  • 用于推理计划的分层任务网络?

  • Hierarchical task networks for reasoning about plans?

  • 贝叶斯网络用于不确定性推理?

  • Bayesian networks for reasoning with uncertainty?

  • 因果图和因果推理,允许代理人有选择地打破逻辑规则?

  • Causal diagrams and causal reasoning where an agent is allowed to selectively break the rules of logic?

  • 马尔可夫模型随着时间的推移进行推理?

  • Markov models for reasoning over time?

  • 用于推理图像、声音或其他数据的深度神经网络?

  • Deep neural networks for reasoning about images, sounds, or other data?

下一步可能的另一个步骤是更深入地研究我们讨论的任何逻辑框架,了解它们的推理规则和现有的推理算法,以及它们的优点、缺点以及它们适用的知识库类型。这些研究中反复出现的主题是研究提供完整证明系统的推理规则,这意味着公理或知识库以及规则允许人们证明所有的系统陈述的系统。此类规则包括命题逻辑的归结推理规则和一阶逻辑的广义归结推理规则适用于特殊类型的知识库。这些对于理论(证明数学定理)和技术(验证和综合)软件和硬件都很重要。最后,某些逻辑严格地比其他逻辑更具表现力,即我们可以在更具表现力的逻辑中表示的一些语句不能用使用表现力较弱的逻辑的语言的任何有限数量的语句来表达。例如,高阶逻辑(我们在本章中没有讨论)严格来说比一阶逻辑(我们在本章中讨论过,并且强大到足以支持整个数学理论)更具表现力。

Another possible next step is to dive deeper into any of the logic frameworks that we discussed, learning their inference rules and existing algorithms for inference, along with their strengths, weaknesses, and which kinds of knowledge bases they apply to. A recurring theme in these studies is investigating inference rules that provide a complete proof system, meaning a system where the axioms or the knowledge base along with the rules allow one to prove all possible true statements. Such rules include the resolution inference rule for propositional logic and the generalized resolution inference rule for first-order logic, which work for special types of knowledge bases. These are all important for theory (proving mathematical theorems), and technology (verifying and synthesizing) software and hardware. Finally, some logics are strictly more expressive than others, in the sense that some statements that we can represent in the more expressive logic cannot be expressed by any finite number of statements using the language of the less expressive logic. For example, higher-order logic (which we did not discuss in this chapter) is strictly more expressive than first-order logic (which we did discuss in this chapter, and which is powerful enough to support entire math theories).

第13章人工智能和偏微分方程

Chapter 13. Artificial Intelligence and Partial Differential Equations

我想模拟整个世界。

H。

I want to model the whole world.

H.

电影《壮志凌云:特立独行者》(Top Gun: Maverick)(2022 年)的第一个场景展示了特立独行者(汤姆·克鲁斯饰)驾驶一架实验性军用飞机,将其推至 10 倍音速(10 马赫),然后在 10.2 马赫左右失去稳定性。迄今为止最快的非虚构有人驾驶飞机可以达到6.7马赫(图13-1)。无论是真实的速度还是不真实的速度,看着物理、数学和工程学结合在一起将这些飞机升空,尤其是它们壮观的空中机动,都是令人着迷的。

The first scene in the movie Top Gun: Maverick (2022) shows Maverick (Tom Cruise) manning an experimental military aircraft and pushing it to 10 times the speed of sound (10 Mach) before losing its stability at around 10.2 Mach. The fastest nonfictional manned aircraft so far can reach 6.7 Mach (Figure 13-1). Real speed or unreal (yet), it is mesmerizing to watch physics, math, and engineering come together to put these planes in the air, especially with their spectacular midair maneuvering.

埃迈1301
图 13-1。有史以来最快的有人驾驶飞机(图片来源

以下是在观看 Maverick 精彩的混战和 10 马赫场景时浮现在脑海中的一些偏微分方程 (PDE):

These are a few of the partial differential equations (PDEs) that come to mind while watching Maverick’s awesome dogfight and 10 Mach scenes:

波传播的波动方程
The wave equation for wave propagation

考虑声速、声波在空气中的传播,以及由于温度和空气密度的变化而导致的不同高度的声音速度的变化。

Think of the speed of sound, the propagation of a sound wave in the air, and the variations of the sound of speed at different altitudes due to the variations in temperature and air density.

流体动力学纳维-斯托克斯方程
Navier-Stokes equations for fluid dynamics

想想流体流动、空气隧道和湍流。

Think of the fluid flow, air tunnels, and turbulence.

燃烧的 G 方程
The G-equation for combustion

想想飞机发动机的燃烧和飞机排气管中冒出的火焰。

Think of the combustion in the aircraft’s engine and the flames coming out of the aircraft’s exhausts.

材料弹性方程
Material elasticity equations

想想飞机机翼面板、升力、机翼面板的屈曲(在压缩下发生的无应力平面运动的过程;见图 13-2),这反过来又减少了载荷-机翼的承载能力。当承载能力低于设计极限时,就会发生故障。

Think of the aircraft wing panel, the lift force, the buckling of the wing panel (the process of out-of-stress plane movement that happens under compression; see Figure 13-2) caused by loading, which in turn reduces the load-carrying capabilities of the wing. When load-carrying capabilities fall below the design limits, failure happens.

埃迈1302
图 13-2。飞机上的屈曲(图片来源

偏微分方程模拟也浮现在脑海中。想象一下飞行路径模拟以及机组人员在计算机屏幕上实时观看 Maverick 的飞行过程时与他们聊天。

PDE simulations also come to mind. Think of the flight path simulation and the crew chatting with Maverick as they watch his flight unfold in real time on their computer screens.

这样的例子还在继续。我们是否在这里声称我们让飞机飞起来是因为我们写下了并解决了偏微分方程?不会。航空博物馆讲述莱特兄​​弟的故事、他们的实验以及航空业的演变。科学和实验齐头并进。相反,我们想说的是,我们可以发明、改进和优化各种设计,因为微分方程和数学。

The list goes on. Are we claiming here that we made aircrafts fly because we wrote down and solved PDEs? No. Aviation museums tell the story of the Wright brothers, their experiments, and the evolution of aviation industry. Science and experimentation go hand in hand. What we want to claim instead is that we can invent, improve, and optimize all kinds of designs because of differential equations and math.

什么是偏微分方程?

What Is a Partial Differential Equation?

偏微分方程是一个方程,这意味着左侧等于右侧,涉及多个变量及其任何偏导数的函数。A函数相对于某个变量的偏导数衡量函数相对于该变量的变化率。常微分方程 (ODE) 是指仅涉及一个变量的函数,例如仅时间、仅空间等(而不是多个变量)及其导数。动态系统是一种非常重要的 ODE,它描述了我们所关心的系统状态随时间的演变,例如粒子系统或业务环境中客户的状态。ODE 涉及系统状态的时间导数,并且动力学被规定为系统状态、系统物理参数和时间的函数。常微分方程看起来像 dX t dt = F X t , A t , t 。我们将在本章中多次访问动态系统。大多数时候,如果我们能够将偏微分方程转换为常微分方程组,或者可能转换为动力系统,那么它或多或少就得到了解决。

A PDE is an equation, which means a lefthand side is equal to a righthand side, that involves a function of several variables along with any of its partial derivatives. A partial derivative of a function with respect to a certain variable measures the rate of change of the function with respect to that variable. Ordinary differential equations (ODEs) are those that involve functions of only one variable, such as only time, only space, etc. (as opposed to several variables), and their derivatives. A dynamic system is a greatly important ODE, describing the evolution in time of the state of a system that we care for, such as a system of particles, or the state of a customer in a business setting. The ODE involves one derivative in time of the state of the system, and the dynamics are prescribed as a function of the system state, system physical parameters, and time. The ODE looks like dx (t) dt = f ( x ( t ) , a ( t ) , t ) . We will visit dynamic systems multiple times in this chapter. Most of the time, if we are able to transform a PDE into a system of ODEs, or maybe into a dynamical system, it is more or less solved.

大自然既没有给我们确定性函数,也没有给我们用来产生我们在周围观察并可以准确测量的世界的联合概率分布。到目前为止,它一直保守着这些秘密。然而,它确实为我们提供了测量、评估或制定有关事物如何相对变化的规律的方法,这正是偏微分方程所代表的。因为事物变化的方式只不过是衍生物。

Nature gave us neither the deterministic functions nor the joint probability distributions that it uses to produce the world that we observe around us and can accurately measure. Until now, it has kept those secret. It did, however, give us ways to measure, assess, or make laws about how things change relative to each other, which is exactly what partial differential equations represent. Because the ways things change are nothing but derivatives.

求解偏微分方程的目标是撤消微分算子,以便我们可以恢复没有任何导数的函数。因此,我们搜索 PDE 表示的微分算子的精确或近似逆(或伪逆)。积分会取消导数,因此偏微分方程的解表示通常涉及一些核函数对偏微分方程的输入数据(其参数、初始和/或边界条件)的积分。随着本章的发展,我们将详细阐述这一点。

The goal of solving a PDE is to undo the differential operator so that we can recover the function without any derivatives. So we search for an exact or an approximate inverse (or pseudoinverse) of the differential operator that the PDE represents. Integrals undo derivatives, so solution representations of PDEs often involve the integrals of some kernel functions against the input data of a PDE (its parameters, initial and/or boundary conditions). We will elaborate on this as the chapter evolves.

人们通常将常微分方程和偏微分方程分为类型。我对此的看法是,我们不应该将自己与分类混淆,除非我们碰巧亲自使用这些特殊的常微分方程或偏微分方程,并且它们的解决方案恰好对人类的未来产生直接和立即的影响。当您在本章中遇到某种类型(例如非线性抛物线或后向随机)时,请接受该名称,然后直接转向理解我想要表达的观点。甚至不要尝试用谷歌搜索这些术语。这类似于谷歌搜索你的症状并发现你明天就会死亡。考虑一下你自己警告。

People usually classify ODEs and PDEs into types. My take on this is that we should not confuse ourselves with classifications unless we happen to be personally working with these special ODEs or PDEs and their solutions happen to have a direct and immediate impact on the future of humanity. When in this chapter you encounter a certain type, such as nonlinear parabolic or backward stochastic, accept the name, then move directly to understanding the point that I am trying to make. Don’t even try to google these terms. It will be similar to googling your symptoms and finding that you will die tomorrow. Consider yourself warned.

使用微分方程建模

Modeling with Differential Equations

微分方程模拟了现实世界中无数的现象:空气湍流、星系运动、纳米尺度材料的行为、金融工具定价、与对手和多个参与者的博弈,以及人口流动和增长。关于偏微分方程的典型课程会跳过建模步骤,因此我们最终学习的偏微分方程似乎是突然出现的,但事实并非如此。偏微分方程的来源与尝试分析和求解它们同样重要。通常,偏微分方程表达一些守恒定律,例如能量守恒定律、质量守恒定律、动量守恒定律等,因为它们与我们的特定应用相关。许多偏微分方程是守恒定律的表达,如下所示:

Differential equations model countless phenomena in the real world: air turbulence, the motions of galaxies, the behavior of materials at the nanoscale, pricing financial instruments, games with adversaries and multiple players, and population mobility and growth. Typical courses on PDEs skip the modeling step, so the PDEs that we end up studying seem to come out of the blue, but that is not the case. Where PDEs come from is as important as trying to analyze them and solve them. Usually, PDEs express some conservation laws, such as conservation of energy, mass, momentum, etc., as they relate to our particular application. Many PDEs are an expression of a conservation statement that looks like:

数量随时间的变化率=收益-损失

rate of change of a quantity in time = gainslosses

现在,当我们有一个有界域时,偏微分方程在内部起作用域的边界条件可以准确地告诉我们域边界发生了什么。如果域是无界的,那么我们需要远场条件条件告诉我们发生了什么 X 无穷大 。我们使用限制符号来编写这些条件。如果偏微分方程在时间上有导数,那么我们需要一些初始时间条件或结束时间条件。我们需要多少个条件取决于偏微分方程的阶数。将这些视为我们需要解多少个方程来求解多少个未知数。未知数是偏微分方程的积分常数。当我们求解偏微分方程时,我们会在给定有关其导数的信息的情况下寻求有关该函数的信息。为了摆脱这些导数并恢复功能,我们必须积分对偏微分方程我们需要边界和/或远场条件来解决这些问题常数。

Now, when we have a bounded domain, the PDE works in the interior of the domain, but we need to accompany it with boundary conditions that tell us exactly what is happening at the boundary of the domain. If the domain is unbounded, then we need far field conditions that tell us what is happening as x . We write these conditions using limits notation. If the PDE has derivatives in time, then we need some initial time conditions, or end time conditions. How many of these conditions we need depends on the order of the PDE. Think of these as how many equations we need to solve for how many unknowns. The unknowns are the integration constants of the PDEs. When we solve a PDE, we seek information about the function, given information about its derivatives. To get rid of these derivatives and recover the function, we must integrate the PDE, getting integration constants along the way. We need the boundary and/or far field conditions to solve for these constants.

不同比例的模型

Models at Different Scales

逼真的模型忠实地模仿自然需要考虑所有重要变量及其相互作用,有时需要考虑不同的空间和时间尺度。一些工作涉及编写数学模型的方程。一旦公式化,它们就很优雅,将大量的信息浓缩成几行方程。这些方程涉及函数、它们的导数和模型参数,通常求解起来比公式化更难。此外,如果两个模型在不同尺度上描述相同的现象,比如一个在原子尺度(快速摆动的分子),另一个在更大尺度上,比如在微观或宏观尺度(我们观察到的),那么这两个模型的方程看起来会非常不同,它们甚至可能依赖于不同科学领域的物理定律。例如,考虑在分子水平上描述气体的运动(粒子速度、位置、作用在其上的力等)以及如何将其与在宏观尺度上观察到的气体系统的热力学联系起来。或者想想原子结合在一起形成晶体结构的方式,以及这些结构如何转化为材料特性,例如导电性、渗透性、脆性等。那么自然的问题是,当每个模型运行并成功地不同规模的某种程度?更准确地说,如果我们将一种模型的限制应用于另一种模型的制度,我们会得到同样的结果吗?这些是分析师要解决的问题类型。协调不同的比例模型可以验证它们并统一数学和科学的不同领域。

Realistic models that mimic nature faithfully need to account for all the important variables along with their interactions, sometimes at varying scales of space and time. Some work goes into writing the equations for the mathematical models. Once formulated, they are elegant, condensing a whole wealth of information into a few lines of equations. These equations involve functions, their derivatives, and the model’s parameters, and are usually harder to solve than to formulate. Moreover, if two models describe the same phenomenon at different scales, say one on the atomistic scale (rapidly wiggling molecules) and another at a bigger scale, say at the microscopic or macro scale (the one we observe), then the two models’ equations would look very different, and they may even be relying on physical laws from different fields of science. Think, for example, about describing the motion of gases at the molecular level (particle velocity, position, forces acting on it, etc.) and how to relate that to the thermodynamics of the gaseous system observed at the macroscopic scale. Or think of the ways atoms bond together to form crystalline structures, and how those structures translate into material properties, such as conductivity, permeability, brittleness, etc. The natural question is then, can we reconcile such models when each operates and is successful to some degree at a different scale? More precisely, if we take the limit of one model to the regime of the other, would we get the same thing? These are the types of questions that analysts address. Reconciling different scale models validates them and unifies different areas of math and science.

偏微分方程的参数

The Parameters of a PDE

偏微分方程我们写下的模型通常会涉及到参数。这些参数与我们正在建模的物理系统的属性有关。例如,对于热方程:

The PDEs that we write down for a model usually involve parameters. These parameters have to do with the properties of the physical system that we are modeling. For example, for the heat equation:

t X , t = α Δ X , t

参数 α 扩散系数,这是一个物理常数,取决于扩散物质的特性及其扩散到的介质的特性。我们通常从实验获得的参考表中得到这些。它们的值对于工程目的非常重要。当我们的方程模拟现实时,我们必须使用从真实实验或观察数据导出的参数值。但实验和观测数据通常都是嘈杂的,并且存在缺失值、无法解释的异常值以及数学模型的各种波涛汹涌。很多时候,我们甚至没有方程中参数的实验值。这些实验可能很昂贵(想想大型强子对撞机),甚至是不可能的。因此,我们必须基于一些其他变量的实验、观察和计算机模拟值的可访问组合,使用间接方式来学习参数值。从历史上看,许多参数值都是手动调整的,以适应某些期望的结果,这是不好的!我们应该对模拟参数值的选择有明确的理由。我们将看到机器学习如何帮助偏微分方程从数据中学习参数值。

the parameter α is the diffusion coefficient, which is a physical constant depending on the properties of the diffusing substance and the properties of the medium it is diffusing into. We usually get these from reference tables obtained from experiments. Their values are very important for engineering purposes. When our equations model reality, we must use parameter values that are derived from this real experimental or observational data. But experimental and observational data are usually noisy, and have missing values, unexplained outliers, and all kinds of rough seas for mathematical models. Many times we don’t even have experimental values for the parameters that are in our equations. The experiments can be expensive (think Large Hadron Collider) or even impossible. So one must learn the parameter values using indirect ways, based on accessible combinations of experimental, observational, and computer simulated values of some other variables. Historically, many of these parameter values were hand tuned to fit some desired outcome, which is not good! We should have clear justifications for the choices of the parameter values that go into simulations. We will see how machine learning helps PDEs here in learning parameter values from data.

改变偏微分方程中的一件事可能是一件大事

Changing One Thing in a PDE Can Be a Big Deal

如果你大学里上过偏微分方程课,回想一下你学过的最简单的方程,也许是棒上的热扩散方程,如图13-3 所示。如果您没有研究过偏微分方程,请不要担心它的细节。热方程的公式为:

If you had a PDE class in college, think back to the simplest equation that you studied, perhaps the heat diffusion equation on a rod, such as those in Figure 13-3. If you did not study PDEs, do not worry about its details. The formula of the heat equation is:

t X , t = α Δ X , t

这里, X , t 测量棒中x点和时间t处的温度,并且操作员 Δ 是x的二阶导数(所以 Δ X , t = XX X , t ),因为如果我们忽略它的厚度,杆只是一维的。在更高的维度中,算子 Δ 是每个维度的二阶导数之和。

Here, u ( x , t ) measures the temperature at the point x in the rod and at time t, and the operator Δ is the second derivative in x (so Δ u ( x , t ) = u xx ( x , t ) ), since a rod is only one-dimensional if we ignore its thickness. In higher dimensions, the operator Δ is the sum of the second derivatives in each of the dimensions.

现在让我们将域从一根杆更改为一个奇怪形状的板:不是正方形、圆形或椭圆形,而是不规则的东西;例如,比较图 13-3图 13-4。适用于杆的真实解的公式(我们在偏微分方程入门课程中学到)不再适用于奇怪的板。情况变得更糟。我们不仅无法通过改变域来获得解析解,而且当我们尝试使用新域对微分方程进行数值求解时,新的几何形状突然使问题变得复杂。

Now let’s change the domain from a rod to a weird-shaped plate: not a square or a circle or ellipse, but something irregular; for example, compare Figure 13-3 with Figure 13-4. The formula for the true solution (which we learn in an introductory PDE class) that works for the rod does not work for the weird plate anymore. It gets even worse. Not only do we lose access to the analytical solution by changing the domain, when we try to numerically solve the differential equation with the new domain, the new geometry suddenly complicates matters.

现在我们必须找到一个离散网格,它能够准确地描绘新域的形状及其所有细节,然后我们必须在该网格之上计算一个数值解,该数值解满足域内部的方程,并满足边界沿着看起来奇怪的边界的条件。

Now we have to find a discrete mesh that accurately portrays the shape of the new domain with all its details, then we have to compute a numerical solution on top of that mesh that satisfies the equation in the interior of the domain, and satisfies the boundary conditions along the weird-looking boundary.

电子邮件1303
图 13-3。研究棒上的热扩散在分析和数值上都很容易(对于研究偏微分方程的人来说)
电子邮件1304
图 13-4。研究不规则几何形状的热扩散并不容易

这在偏微分方程中很正常:改变一件小事,突然间我们学到的所有数学方法可能不再适用。这些变化包括:

This is normal in PDEs: Change one tiny thing and suddenly all the mathematical methods that we learned may not apply anymore. Such changes include:

  • 改变域的形状

  • Change the shape of the domain

  • 更改边界条件的类型

  • Change the types of the boundary conditions

  • 在系数(参数)中引入空间或时间依赖性

  • Introduce a space or time dependence in the coefficients (the parameters)

  • 引入非线性

  • Introduce nonlinearities

  • 引入更多导数(高阶)的项

  • Introduce terms with more derivatives (higher order)

  • 引入更多变量(更高维度)

  • Introduce more variables (higher dimension)

这一令人沮丧的方面使许多学生放弃了专门研究偏微分方程(没有人想成为仅一个方程的专家,这可能从一开始就与现实建模相去甚远)。我们不想被关闭。我们希望看到大局。

This frustrating aspect turns off many students from specializing in PDEs (no one wants to be an expert at only one equation, which could be very removed from modeling reality to start with). We don’t want to be turned off. We want to see the big picture.

自然现象千差万别,因此我们必须接受偏微分方程及其求解方法的变化,作为我们理解和预测自然的探索的一部分。此外,偏微分方程是一个庞大而古老的领域。在统一许多线性和非线性偏微分方程族的方法方面已经取得了很多进展,并且在此过程中发现了许多强大的分析。现状是偏微分方程是一个非常有用的领域,它没有,也可能永远不会有统一的理论。

Natural phenomena are wonderfully varied, so we have to accept the variations in the PDEs and their solution methods as part of our quest to understand and predict nature. Moreover, PDEs are a large and old field. A lot of progress has been made on unifying methods for many families of both linear and nonlinear partial differential equations, and a lot of powerful analysis has been discovered along the way. The status quo is that PDEs are a very useful field that does not have, and might never have, a unifying theory.

一般来说,非线性偏微分方程比线性偏微分方程更困难,高阶偏微分方程比低阶偏微分方程更困难,高维偏微分方程比低维偏微分方程更困难,偏微分方程组比单一偏微分方程更困难,我们不能为大多数偏微分方程的解编写显式公式,并且许多偏微分方程只能以形式得到满足。许多偏微分方程的解会随着时间的推移而产生奇点(想想波动方程和冲击波)。发展偏微分方程理论的数学家花时间证明偏微分方程解的存在性,并试图理解这些解的规律性,这意味着他们在实际拥有偏微分方程所涉及的导数方面有多好。这些使用了许多高级微积分方法,寻找积分的估计(上限和下限的不等式)下限)。

In general, nonlinear PDEs are more difficult than linear PDEs, higher-order PDEs are more difficult than lower-order ones, higher-dimensional PDEs are more difficult than lower-dimensional ones, systems of PDEs are more difficult than single PDEs, we cannot write explicit formulas for solutions for the majority of PDEs out there, and many PDEs are only satisfied in weak forms. Many PDEs have solutions that develop singularities as time evolves (think the wave equation and shock waves). Mathematicians who develop PDE theory spend their time proving the existence of solutions of PDEs, and trying to understand the regularity of these solutions, which means how nice they are in terms of actually possessing the derivatives that are involved in the PDE. These use a lot of advanced calculus methods, looking for estimates on integrals (inequalities for upper and lower bounds).

人工智能可以介入吗?

Can AI Step In?

现在,如果我们有方法能够解释偏微分方程、域的几何形状、边界条件和参数范围的变化,类似于实际的物理问题,那不是很好吗?行业和科学领域的许多部门都着眼于人工智能和深度学习,以解决长期存在的问题或为它们提供新的线索。过去十年,超高维问题的计算解决方案取得了天文数字般的进步,有可能改变许多受维数诅咒所束缚的领域。这样的转变对于偏微分方程乃至整个人类来说都是一场翻天覆地的变化,因为偏微分方程及其解决方案释放了大量的科学知识。

Now, wouldn’t it be great if we had methods that account for variations in the PDE, the geometry of the domain, the boundary conditions, and the parameter ranges, similar to the actual physical problems? Many sectors of the industry and areas of science have their eyes on AI and deep learning to address their long-standing problems or shed new light on them. The past decade’s astronomical advancement in computing solutions of very high-dimensional problems has the potential to transform many fields held down by the curse of dimensionality. Such a transformation would be a sea change for PDEs and in turn for humanity as a whole because of the sheer amount of science that is unlocked by PDEs and their solutions.

在本章的其余部分中,我们将重点介绍微分方程界使用传统方法寻找偏微分方程解以及将真实数据和噪声数据拟合到模型中所遇到的障碍。然后,我们将说明机器学习如何介入来帮助绕过或缓解这些困难。我们还思考两个问题:

For the rest of this chapter, we highlight the hurdles that the differential equations community encounters with the traditional approaches to finding solutions to their PDEs, and with fitting real and noisy data into their models. We then illustrate how machine learning is stepping in to help bypass or alleviate these difficulties. We also consider two questions:

  • 人工智能可以为偏微分方程做什么?

  • What can AI do for PDEs?

  • 偏微分方程可以为人工智能做什么?

  • What can PDEs do for AI?

我们需要确保在提供偏微分方程时,训练函数、损失函数优化设置的机器学习标志清晰,以及监督学习的标签或目标。将成熟的偏微分方程领域融入到机器学习环境中并不是那么简单。理想情况下,我们需要建立从偏微分方程到其解的映射。这需要一些暂停和思考。

We need to make sure that the machine learning hallmarks of training function, loss function, and optimization settings are clear when serving PDEs, along with the labels or targets for supervised learning. Fitting the well-established field of PDEs into a machine learning setting is not super straightforward. Ideally, we need to establish a map from a PDE to its solution. This requires some pausing and thinking.

数值解非常有价值

Numerical Solutions Are Very Valuable

写作以方程的形式描述自然现象的数学模型描述所涉及的变量如何相互作用,这只是第一步。我们需要解这些方程。

Writing a mathematical model that describes a natural phenomenon, in the form of equations describing how the involved variables interact with each other, is only a first step. We need to solve these equations.

解析解比数值解更难,因为模型越模仿自然,方程就越复杂。即使分析方法无法提供解决方案的公式,它们仍然可以提供对其重要属性的有价值的见解。数值解比解析解更容易,因为它们涉及离散化连续方程,将我们从连续函数领域转移到离散数领域,或者从无限维函数空间转移到有限维向量空间(线性代数),我们的机器是为了计算而构建的。数值解为模型的真实解析解提供了宝贵的见解,并且很容易根据实验观察进行测试(如果可用)。它们也很容易调整,因此对于实验设计来说是很好的帮助。

Analytical solutions are harder than numerical solutions, since the more the model mimics nature, the more complex the equations tend to be. Even when analytical methods cannot provide formulas for the solutions, they still provide valuable insights into their important properties. Numerical solutions are easier than analytical solutions, because they involve discretizing the continuous equations, moving us from the realm of continuous functions to the realm of discrete numbers, or from infinite dimensional function spaces to finite dimensional vector spaces (linear algebra), which our machines are built to compute. Numerical solutions provide invaluable insights into the true analytical solutions of the models, and are easy to test against experimental observations, when available. They are also easy to tune, so they are great aids for experimental design.

我们可以设计任何尺度的数值解,但是当我们尝试实现和计算数值方案时,维度的诅咒一直困扰着我们。在许多情况下,数值模拟要模仿系统自然演化的一秒钟,就需要大量的计算能力,因此必须进行大量的降维和简化假设,这使我们离获得良好的近似值更远了的真正解决方案。可悲的是,这是常态,而不是例外。

We can devise numerical solutions at any scale, but the curse of dimensionality haunts us when we try to implement and compute our numerical schemes. In many situations, for a numerical simulation to mimic even one second of the natural evolution of a system requires a tremendous amount of computational power, so a lot of dimension reductions and simplification assumptions must happen, which moves us even farther from having a good approximation of the true solutions. The sad part is that this is the norm rather than the exception.

连续函数与离散函数

Continuous Functions Versus Discrete Functions

功能 F X = X 2 - 3 在整条实线上连续 - 无穷大 , 无穷大 。当我们将其离散化为机器要处理的数值方案时,首先,该域不再是整条实数线,因为机器尚无法概念化无限域。因此,我们的第一个近似是将域大幅削减到某个有限的 [– N , N ],其中 N 是一个很大的数字。我们的第二个近似是离散化这个有限域,再次大幅减少它,从连续体 [– N , N ] 到只有一组有限的点。如果我们使用很多点,那么我们的网格会更细,近似值也会更好,但代价是计算成本增加。假设我们仅使用六个点来离散化区间 [–5,5]:–5, –3, –1, 1, 3, 5,那么我们的连续函数将简化为只有六个条目的向量:

The function f ( x ) = x 2 - 3 is continuous over the whole real line ( - , ) . When we discretize it for a numerical scheme for a machine to process, first, the domain cannot be the whole real line anymore, because machines cannot yet conceptualize infinite domains. So our first approximation is slashing the domain dramatically to some finite [–N,N] where N is a large number. Our second approximation is discretizing this finite domain, drastically reducing it one more time, from a continuum [–N,N] to only a finite set of points. If we use many points, then our mesh will be finer and our approximation better, at the expense of increased computation cost. Say we use six points only to discretize the interval [–5,5]: –5, –3, –1, 1, 3, 5, then our continuous function will be reduced to a vector with only six entries:

F X = X 2 - 3 连续的 - 无穷大 , 无穷大 离散的F = -5 2 - 3 -3 2 - 3 -1 2 - 3 1 2 - 3 3 2 - 3 5 2 - 3 = 22 6 - 2 - 2 6 22

图 13-5显示了连续函数及其极不具有代表性的六点近似。

Figure 13-5 shows the continuous function and its insanely under-representative six point approximation.

电子邮件1305
图 13-5。将连续函数离散成只有六个点的向量。我们失去了点之间所有连续丰富的信息。

我们仍然需要离散化导数

We Still Need to Discretize Derivatives

我们可以通过选择区间中的点来离散化函数f(x) ,就像我们上面所做的那样。微分方程包含函数的导数,例如 F X , Δ F ,不仅仅是功能。因此,我们必须离散导数,或者找到其他方法将问题从函数空间(例如连续空间)简化为向量空间(这样我们就可以使用线性代数并使用我们的机器进行计算)。有限差分和有限元是微​​分方程的两种流行的离散方法。我们很快就会回顾它们,以及基于随机游走的概率蒙特卡罗方法。

We can discretize a function f(x) by selecting points in an interval, just like we did above. Differential equations contain derivatives of functions such as f x , Δ f , not only functions. So we must discretize the derivatives, or find some other way to reduce a problem from functional spaces (such as continuous spaces) to vector spaces (so we can use linear algebra and compute using our machines). Finite differences and finite elements are two popular discretization methods for differential equations. We will go over them shortly, along with a probabilistic Monte Carlo method based on random walks.

数值解的简单性的一个权衡是,当我们离散化时,我们会进行近似,将无限连续统简化为有限点集,丢失有限点集之间的所有无限详细信息。也就是说,我们牺牲了高分辨率。对于某些方程,有一些分析方法可以帮助我们准确量化离散化损失的信息量,并通过在离散网格的大小变为零时取极限来帮助我们回到精确的解析解。

One trade-off of the simplicity of numerical solutions is that when we discretize, we make an approximation, reducing an infinite continuum to a finite set of points, losing all the infinitely detailed information that is between the finite set of points. That is, we sacrifice high resolution. For certain equations, there are analytical methods that help us quantify exactly how much information we lose by discretization, and help move us back to an accurate analytical solution by taking the limit as the size of the discrete mesh goes to zero.

离散化连续函数和涉及它们的方程具有以下优点:易于访问。我们可以教高中生如何数值求解描述棒中热扩散的热方程(本章很快),但在他们完成大学微积分序列和线性代数之前,我们无法教他们如何解析求解该方程。这就是为什么我们必须在孩子很小的时候就教他们如何建模和计算现实生活问题的数值解。数值解的简单性和计算能力有助于解决各种人类问题,这应该成为我们教育系统的优先事项。我怀疑大自然是否打算让我们在计算我们周围的世界如何运作之前建立和阐明极其复杂的数学理论。我也怀疑自然是否像某些数学理论那样复杂(尽管它们本身仍然很有趣,如果只是作为逻辑和推理规则能够发挥多大作用的练习)领导我们)。

Discretizing continuous functions and the equations that involve them has advantages: easy access. We can teach high school students how to numerically solve the heat equation describing the diffusion of heat in a rod (soon in this chapter), but we cannot teach them how to solve it analytically until they finish their college calculus sequence and linear algebra. This is why we must teach children how to model and compute numerical solutions of real-life problems at a very young age. The simplicity of numerical solutions and the power of computation to aid in solving all kinds of human problems should make this a priority in our education system. I doubt that nature intended for us to build and unravel crazily complicated mathematical theories before computing how the world around us works. I also doubt that nature is as complicated as some mathematical theories happen to be (even though they are still interesting in their own right, if only as an exercise in how far the rules of logic and inference can lead us).

我的博士论文中的偏微分方程主题 论文

PDE Themes from My Ph.D. Thesis

故事我的博士学位 论文展示了数学理论和数值方法之间的巨大差异。它也是本章某些主题的一个很好的原型。在攻读博士学位时,我研究了一个数学模型,该模型描述了原子在薄晶体阶梯状表面的不同层面之间扩散和跳跃的方式。这对于材料科学界以及参与设计电子设备中的微型器件的工程师来说非常有用。随着时间的推移,晶体的形状由于其表面原子的运动而发生变化。最终,晶体松弛成某种稳定的形状。

The story of my Ph.D. thesis demonstrates the drastic difference between mathematical theory and numerical approaches. It is also a good prototype for some of the themes of this chapter. For my Ph.D., I worked on a mathematical model that describes the way atoms diffuse and hop between different levels of a stair-like surface of a thin crystal. This is useful for the materials science community and for the engineers involved in designing the mini things that go into our electronic devices. As time evolves, the crystal’s shape changes due to the movement of atoms on its surface. Eventually, the crystal relaxes into some stable shape.

立即离散并进行计算机模拟

Discretize right away and do a computer simulation

当我写下方程式时,我就能够进行计算机模拟,显示晶体的形状如何随时间演变。这是我研究的偏微分方程之一(并不是说你应该关心它或知道其中的函数指的是什么):

The moment I wrote down the equations, I was able to do a computer simulation that showed how the shape of the crystal evolves with time. This is one of the PDEs that I worked on (not that you should care about it or know what the function in it refers to):

t H , t = - 2 3 HHHH 在哪里 H ε [ 0 , 1 ] , t ε [ 0 , 无穷大

对于受过训练的眼睛来说,这是一个高度非线性的四阶方程。未知函数u同时出现平方和立方。它的立方体在空间中具有四个导数,我们可以将其视为从我们想要评估的函数中移除的四度。图 13-6显示了我的偏微分方程在空间中使用有限差分(我们将很快讨论有限差分)及其边界条件(点 0 和 1 处的函数值)的离散化。

For the trained eye, this is a highly nonlinear fourth order equation. The unknown function u appears both squared and cubed. Its cube appears with four derivatives in space, which we can think of as four degrees removed from the function that we want to evaluate. Figure 13-6 shows my PDE’s discretization in space using finite differences (we will discuss finite differences shortly) and its boundary conditions (the function values at points 0 and 1).

电子邮件1306
图 13-6。离散微分方程及其连续统模拟

通常,方程越非线性,就越难以服从标准分析技术。我还需要进行数学分析,证明数值模拟所显示的形状确实是方程想要解的形状,这意味着它是解析解,并且是大自然在可能的其他解中选择的解。我不得不在接下来的两年里只做这件事。我想出的是一个小盒子里的小证据,证明了物理上不现实的一维晶体!我必须将方程减少到只有一维,才能用它们进行任何数学分析。

Usually, the more nonlinear an equation is, the more stubborn it is in submitting to standard analytical techniques. I still needed to do the mathematical analysis and prove that the shape that the numerical simulation showed is indeed what the equations want the solution to do, meaning that it is the analytical solution, and it is what nature chooses among possible others. I had to spend the next two years of my life doing only that. What I came up with was a tiny proof in a tiny case for a physically unrealistic one-dimensional crystal! I had to reduce my equations to only one dimension to be able to do any mathematical analysis with them.

维度的诅咒

The curse of dimensionality

一个主题始终存在的是问题的维度。即使我进行了不到一下午的数值模拟,也只能针对一维域上的方程进行。当我尝试对位于平坦表面上的真实薄膜实验室晶体进行模拟时,这意味着当我必须离散化二维表面而不是一维段时,离散点的数量从 100 个跃升至一维线段为 100 2 = 100 , 000 在二维表面上。当时我的计算机无法对一维情况下仅花费几秒钟的完全相同的方程进行数值求解。当然,我还不够成熟,无法在大学的服务器上进行计算或使用并行计算(我不知道分布式云计算当时是否已经发明)。这就是带有维度诅咒的生命。计算费用随着维数的增加而呈指数增长。现在让我们考虑一下域具有高维度的方程(甚至在离散化之前),例如量子粒子系统的薛定谔方程、金融工具定价的布莱克-斯科尔斯方程或动态规划中的汉密尔顿-雅可比-贝尔曼方程模拟多人游戏或资源分配问题。想象一下维数灾难的严重程度。

One theme that is always present is the dimensionality of the problem. Even when I did the numerical simulation, which took less than one afternoon, I was only able to do it for the equation over the one-dimensional domain. When I tried to do a simulation to model a realistic thin film laboratory crystal that lies on a flat surface, meaning when I had to discretize a two-dimensional surface instead of a one-dimensional segment, the number of discrete points jumped from 100 on the one-dimensional segment to 100 2 = 100 , 000 on the two-dimensional surface. My computer at the time could not numerically solve the exact same equation that took only few seconds in the one-dimensional case. Granted, I was not sophisticated enough to compute on the university’s server or use parallel computing (I do not know if distributed cloud computing had even been invented back then). Such is life with the curse of dimensionality. The computational expense rises exponentially with the number of dimensions. Now let’s think of the equations whose domains have high dimensions to start with (even before discretization), like the Schrödinger equation for quantum particle systems, the Black-Scholes equation for pricing financial instruments, or the Hamilton-Jacobi-Bellman equation in dynamic programming that models multiplayer games or resource allocation problems. Imagine then the magnitude of the curse of dimensionality.

问题的几何形状

The geometry of the problem

我们之前提到过但值得重复的另一个主题:域的形状对于分析和数值计算都很重要。在我不切实际的一维情况下,我使用了一个线段作为方程的域。在二维情况下,我有更多选择:矩形(具有规则网格的优点)、圆形(具有径向对称的优点)或任何其他通常没有名称的现实不规则形状。对于分析来说,矩形和圆形域是最简单的(不是针对我的特定方程,而是针对其他更简单的方程,例如线性方程)。对于模拟来说,这些也很好。但是,当域的形状不规则时(大多数现实事物都是这种情况),如果我们想忠实地捕获域,我们需要在不规则的部分放置更多离散点。维度的诅咒再次出现:更多的点意味着更长的向量和更大的计算输入矩阵。

Another theme that we mentioned earlier but is worth repeating: the shape of the domain matters, for both analysis and numerical computation. In my unrealistic one-dimensional case, I used a segment as the domain for my equation. In the two-dimensional case, I had many more choices: a rectangle (has the advantage of a regular grid), a circle (has the advantage of radial symmetry), or any other realistic irregular shape that doesn’t usually have a name. For analysis, the rectangle and the circle domains are the easiest (not for my particular equations but for other simpler equations like linear equations). For simulations, these are good too. But when the shape of the domain is not regular, which is the case for most realistic things, we need to place more discrete points in the parts where it is irregular if we want to capture the domain faithfully. The curse of dimensionality rears its unwelcome head again: more points mean longer vectors and larger input matrices for computations.

为您关心的事物建模

Model things that you care for

完成我的博士学位。故事中,我从未见过像我正在研究的那样真正的薄膜晶体,直到我完成学位十年后,当时我的朋友向我展示了她实验室中的金薄膜晶体。回想起来,也许我应该从这里开始,看看试图建模的现实生活到底是什么。现在我的优先事项有所不同,我总是首先问自己是否关心我要建模的内容,我选择的模型与现实的模仿程度如何,以及思考分析解决方案是否值得花时间和精力的努力这个特定的应用程序。

To finish my Ph.D. story, I never saw a real thin film crystal like the one I was working on until 10 years after I finished my degree, when my friend showed me a thin film crystal of gold in her lab. In retrospect, maybe I should have started there, by seeing exactly what it is in real life that I was trying to model. My priorities now are aligned differently, and I always start by asking whether I care for what it is I am trying to model, how closely the model I choose to work on mimics reality, and whether thinking about analytical solutions is even worth the time and the effort for this particular application.

离散化和维数灾难

Discretization and the Curse of Dimensionality

数学家他们研究偏微分方程喜欢连续世界,而研究机器喜欢离散世界。数学家喜欢分析函数,而机器喜欢计算函数。为了协调两者,并使机器为数学家提供帮助,反之亦然,我们可以离散化连续统方程。如何?首先,我们离散方程的域,创建离散网格。我们选择网格的类型(规则或不规则)以及网格的细度或粗度。然后我们使用四种流行方法之一对微分方程本身进行离散化:

Mathematicians who study PDEs like the continuous world, but machines like the discrete world. Mathematicians like to analyze functions, but machines like to compute functions. To reconcile the two, and for machines to be of aid to mathematicians and vice versa, we can discretize our continuum equations. How? First, we discretize the domain of the equation, creating a discrete mesh. We choose the type of mesh (regular or irregular) and how fine or coarse it is. Then we discretize the differential equation itself, using one of four popular methods:

有限差异
Finite differences

确定性,有利于离散时间、一维或相对规则的空间几何形状。

Deterministic, good for discretizing time, one-dimensional or relatively regular spatial geometries.

有限元
Finite elements

确定性,适用于离散更复杂的空间几何形状,以及随时间变化的空间几何形状。

Deterministic, good for discretizing more complex spatial geometries, also spatial geometries that vary with time.

变分法或能量法
Variational or energy methods

这类似于有限元,但适用于更窄的偏微分方程组。它们应该具有变分原理能量公式,即偏微分方程本身应该等于 = 0 对于某些能量泛函E(u)(将函数映射到实线)。我能够获得博士学位的原因是 是我纯属运气地发现了我的偏微分方程的能量泛函。就像微积分函数的最小值发生在以下点上一样: F X = 0 ,能量泛函的最小值出现在以下函数 = 0 ,但是我们当然需要定义对泛函求导的含义。

This is similar to finite elements but works on a narrower set of PDEs. They should possess a variational principle, or an energy formulation, that is, the PDE itself should be equivalent to E ( u ) = 0 for some energy functional E(u) (mapping functions to the real line). The reason I was able to get my Ph.D. is that I discovered such an energy functional for my PDE by pure luck. Just like the minimum of a calculus function happens at points where f ( x ) = 0 , the minimum of an energy functional happens at functions where E ( u ) = 0 , but of course we need to define what it means to take the derivative of a functional.

蒙特卡罗方法
Monte Carlo methods

概率,从离散偏微分方程开始,然后使用它来设计适当的随机游走方案,使我们能够在域中的某个点聚合解决方案。

Probabilistic, starts with discretizing the PDE, then uses that to devise an appropriate random walk scheme that enables us to aggregate the solution at a certain point in the domain.

这些方法中的“有限”一词强调了这样一个事实:该过程将我们从函数的无限维空间的连续统转移到向量的有限维空间。

The word finite in these methods stresses the fact that this process moves us from a continuum of infinite dimensional spaces of functions to finite dimensional spaces of vectors.

如果我们用于离散化的网格太细,它会捕获更高的分辨率,但我们最终会得到高维向量和矩阵。请记住这种维数诅咒,以及以下几点:神经网络人气飙升的主要原因之一是它们似乎具有克服维数诅咒的神奇能力。我们会看看多久。

If the mesh we use for discretization is too fine, it captures more resolution, but we end up with high-dimensional vectors and matrices. Keep this curse of dimensionality in mind, and the following: one of the main reasons neural networks’ popularity skyrocketed is that they seem to have a magical ability to overcome the curse of dimensionality. We will see how soon.

有限差分

Finite Differences

我们用有限差分对偏微分方程中出现的函数的导数进行数值近似。例如,粒子的速度是其位置向量对时间的导数,粒子的加速度是其位置向量对时间的二次导数。

We use finite differences to numerically approximate the derivatives of the functions that appear in PDEs. For example, a particle’s velocity is the derivative in time of its position vector, and a particle’s acceleration is two derivatives in time of its position vector.

在有限差分近似中,我们将导数替换为域中离散点处函数值的线性组合。回想一下,一种导数测量函数的变化率。两个导数测量凹度。更高的导数可以衡量一些科学界人士碰巧使用的更多东西。函数在某一点的导数与其在该点附近的值之间的相互比较之间的联系非常直观。

In finite difference approximations, we replace the derivatives with linear combinations of function values at discrete points in the domain. Recall that one derivative measures a function’s rate of change. Two derivatives measure concavity. Higher derivatives measure more stuff that some people in the sciences happen to use. The connection between a function’s derivatives at a point and how its values near that point compare to each other is pretty intuitive.

这些近似值的数学论证依赖于微积分的泰勒定理:

The mathematical justification for these approximations rely on Taylor’s theorem from calculus:

F X = F X + F ' X X - X + F '' X 2 X - X 2 + F 3 X 3 X-X 3 + + F n X n X-X n + 错误 学期

其中误差项取决于下一阶导数的好坏 F n+1 Ψ 靠近 X 我们尝试使用多项式近似。泰勒定理在某个点附近用多项式逼近一个足够好的函数,该多项式的系数由函数在该点的导数确定。函数在某一点上的导数越多,它就越好,并且在该点附近的表现就越像多项式。

where the error term depends on how nice the next order derivative f (n+1) ( ξ ) is near the x i where we are attempting to use our polynomial approximation. Taylor’s theorem approximates a nice enough function near a point with a polynomial whose coefficients are determined by the derivatives of the function at that point. The more derivatives the function has at a point, the nicer it is and the more like a polynomial it behaves near that point.

现在让我们离散化一个一维区间 [a,b],然后写出在此区间上定义的函数f(x)的导数的有限差分近似。我们可以使用n + 1 个等距点离散 [a,b] ,因此网格大小为 H = -A n 。我们现在可以在这些离散点中的任何一个处评估f 。如果我们关心某个点附近的值 X ,我们定义 F +1 = F X + H , F +2 = F X + 2 H , F -1 = F X - H 等。在下面,h很小,所以 H 2 方法(或h中的更高阶)比 H 方法:

Now let’s discretize a one-dimensional interval [a,b], then write down finite difference approximations for the derivatives of a function f(x) defined over this interval. We can discretize [a,b] using n + 1 equally spaced points, so the mesh size is h = b-a n . We can now evaluate f at any of these discrete points. If we care about the values near some point x i , we define f i+1 = f ( x i + h ) , f i+2 = f ( x i + 2 h ) , f i-1 = f ( x i - h ) , etc. In the following, h is small, so an O ( h 2 ) method (or higher order in h) is more accurate than an O ( h ) method:

  1. 一阶导数的O ( h ) 精度的前向差分近似(使用两个点):

    F ' X F +1 -F H
  2. Forward difference approximation of O(h) accuracy for the first derivative (uses two points):

    f ' ( x i ) f i+1 -f i h
  3. 一阶导数的后向差分近似O ( h ) 精度(使用两个点):

    F ' X F -F -1 H
  4. Backward difference approximation of O(h) accuracy for the first derivative (uses two points):

    f ' ( x i ) f i -f i-1 h
  5. 中心差分近似 H 2 四阶导数的准确度(使用两个点,向前和向后差异的平均值):

    F ' X F +1 -F -1 2H F '' X F +1 -2F +F -1 H 2 F ''' X F +2 -2F +1 +2F -1 -F -2 2H 3 F 4 X F +2 -4F +1 +6F -4F -1 +F -2 H 4
  6. Central-difference approximations of O ( h 2 ) accuracy for derivatives up to the fourth (uses two points, averages forward and backward differences):

    f ' ( x i ) f i+1 -f i-1 2h f '' ( x i ) f i+1 -2f i +f i-1 h 2 f ''' ( x i ) f i+2 -2f i+1 +2f i-1 -f i-2 2h 3 f (4) ( x i ) f i+2 -4f i+1 +6f i -4f i-1 +f i-2 h 4
  7. 中心差分近似 H 4 四阶导数的精度:

    F ' X -F +2 +8F +1 -8F -1 +F -2 12H F '' X -F +2 +16F +1 -30F +16F -1 -F -2 12H 2 F ''' X -F +3 +8F +2 -13F +1 +13F -1 -8F -2 +F -3 8H 3 F 4 X -F +3 +12F +2 -39F +1 +56F -39F -1 +12F -2 -F -3 6H 4
  8. Central-difference approximations of O ( h 4 ) accuracy for derivatives up to the fourth:

    f ' ( x i ) -f i+2 +8f i+1 -8f i-1 +f i-2 12h f '' ( x i ) -f i+2 +16f i+1 -30f i +16f i-1 -f i-2 12h 2 f ''' ( x i ) -f i+3 +8f i+2 -13f i+1 +13f i-1 -8f i-2 +f i-3 8h 3 f (4) ( x i ) -f i+3 +12f i+2 -39f i+1 +56f i -39f i-1 +12f i-2 -f i-3 6h 4

什么是 H k 意思是?它是h中数值近似的阶数。当我们使用其数值近似值替换导数时,我们会犯一个错误。这 H k 告诉我们我们犯了多少错误。显然这取决于网格h的大小。网格越细,误差应该越小。为了导出这样的误差界限,我们使用f ( x + h )、f ( xh )、f ( x + 2 h )、f ( x – 2 h ) 等的泰勒展开式以及这些的线性组合来确定所需导数的近似值和以h表示的有限差分近似值的阶数。为了能够使用泰勒展开式,我们假设我们正在处理的函数在我们评估它们的点确实具有所需数量的导数。这意味着我们假设我们的函数足够好,可以进行这些导数评估。如果函数在这些点附近有奇点,那么我们需要找到解决方法,例如在奇点附近使用更精细的网格。

What does the O ( h k ) mean? It is the order of the numerical approximation in h. When we replace a derivative using its numerical approximation, we commit an error. The O ( h k ) tells us how much error we are committing. Obviously this depends on the size of the mesh h. The error should be smaller with finer meshes. To derive such error bounds, we use Taylor expansions of f(x + h), f(xh), f(x + 2h), f(x – 2h), etc. and linear combinations of those to determine both the desired derivative’s approximation and the order of our finite difference approximation in terms of h. To be able to use Taylor expansions, we assume that we are dealing with functions that indeed have the required number of derivatives at the points where we are evaluating them. This means that we assume that our function is nice enough to allow these derivative evaluations. If the function has singularities near these points, then we need to find ways around that, such as using much finer meshes near the singularities.

示例:求解 y '' X = 1 在 [0,1] 上,边界条件 y(0)=-1 且 y(1)=0

Example: Solve y '' ( x ) = 1 on [0,1], with boundary conditions y(0)=-1 and y(1)=0

这是一维有界域上的二阶线性常微分方程。这个例子很简单,因为解析解非常简单。我们所要做的就是对方程进行两次积分并恢复没有导数的函数 y X = 0 5 X 2 + C 1 X + C 2 ,其中c是积分常数。我们代入两个边界条件来找到c并获得解析解 y X = 0 5 X 2 + 0 5 X - 1 。然而,这个例子的目的是展示如何使用有限差分来计算数值解,而不是解析解,因为解析解不适用于许多其他微分方程,所以我们不妨擅长这一点。我们首先离散化域 [0,1]。我们可以使用任意数量的点。点越多,我们要处理的维度就越高,但分辨率会更好。我们将仅使用八个点,因此网格大小为h = 1/7(图 13-7)。我们的连续统 [0,1] 区间现在减少到八个点(0、1/7、2/7、3/7、4/7、5/7、6/7、1)。

This is a second linear order ordinary differential equation on a bounded domain in one dimension. This example is trivial because the analytical solution is so easy. All we have to do is integrate the equation twice and recover the function without its derivatives y ( x ) = 0 . 5 x 2 + c 1 x + c 2 , where the c’s are the constants of integration. We plug in the two boundary conditions to find the c’s and obtain the analytical solution y ( x ) = 0 . 5 x 2 + 0 . 5 x - 1 . However, the point of this example is to show how to use finite differences to compute the numerical solution, not the analytical one, since analytical solutions are not available for many other differential equations, so we might as well get good at this. We first discretize the domain [0,1]. We can use as many points as we want. The more points, the higher dimension we have to deal with, but the resolution will be better. We’ll use only eight points, so the mesh size is h = 1/7 (Figure 13-7). Our continuum [0,1] interval is now reduced to the eight points (0, 1/7, 2/7, 3/7, 4/7, 5/7, 6/7, 1).

电子邮件 1307
图 13-7。使用八个离散点对单位区间进行离散化,这与七个区间相同。步长(或网格尺寸)为 h = 1/7。

接下来,我们对微分方程进行离散化。我们可以使用任何有限差分格式来离散二阶导数。让我们选择 H 2 中心差,因此离散微分方程变为:

Next, we discretize the differential equation. We can use any finite difference scheme to discretize the second derivative. Let’s choose the O ( h 2 ) central difference, so the discretized differential equation becomes:

y +1 -2y +y -1 H 2 = 1 为了 = 1 , 2 , 3 , 4 , 5 , 6

请注意,微分方程仅在域内部有效,这就是为什么我们在编写其离散模拟时不包括点i = 0 和i = 7。我们从边界条件中得到i = 0 和i = 7处的值: y 0 = - 1 y 7 = 0 。现在,我们有一个由六个方程和六个未知数组成的系统, y 1 , y 2 , y 3 , y 4 , y 5 , y 6 :

Note that the differential equation is only valid in the interior of the domain, that is why we do not include the points i = 0 and i = 7 when we write its discrete analog. We get the values at i = 0 and i = 7 from the boundary conditions: y 0 = - 1 and y 7 = 0 . Now, we have a system of six equations and six unknowns, y 1 , y 2 , y 3 , y 4 , y 5 , y 6 :

y 2 - 2 y 1 - 1 = 1 / 49 y 3 - 2 y 2 + y 1 = 1 / 49 y 4 - 2 y 3 + y 2 = 1 / 49 y 5 - 2 y 4 + y 3 = 1 / 49 y 6 - 2 y 5 + y 4 = 1 / 49 0 - 2 y 6 + y 5 = 1 / 49

现在我们已经从连续统世界转移到线性代数世界:

So now we’ve moved from the continuum world to the linear algebra world:

- 2 1 0 0 0 0 1 - 2 1 0 0 0 0 1 - 2 1 0 0 0 0 1 - 2 1 0 0 0 0 1 - 2 1 0 0 0 0 1 - 2 y 1 y 2 y 3 y 4 y 5 y 6 = 1 / 49 + 1 1 / 49 1 / 49 1 / 49 1 / 49 1 / 49

求解该系统相当于反转该三对角矩阵,该矩阵是二阶导数算子的离散模拟。在连续世界中,我们对微分算子进行积分以恢复y ( x ),在离散世界中,我们对离散算子求逆以恢复离散值 y 。使用更多点来离散化域时,请记住维数灾难。

Solving this system amounts to inverting that tridiagonal matrix, which is the discrete analog of our second-order derivative operator. In the continuum world we integrate the differential operator to recover y(x), and in the discrete world we invert the discrete operator to recover the discrete values y i . Keep the curse of dimensionality in mind when using more points to discretize the domain.

显然我们必须比较离散值 y 与他们的精确对应者 y X 看看我们的有限差分方案在只有八个离散点的情况下执行得如何(图 13-8)。图 13-9显示了数值解(仅使用四个离散点)与精确解析解的关系图。

Obviously we must compare the discrete values y i with their exact counterparts y ( x i ) to see how well our finite difference scheme with only eight discrete points performed (Figure 13-8). Figure 13-9 shows the graph of the numerical solution (using only four discrete points) against the exact analytical solution.

电子邮件1308
图 13-8。将每个离散点的数值解与精确解析解进行比较
电子邮件1309
图 13-9。数值解(仅使用四个离散点)与精确解析解(实线)的关系图

现在,我们可以使用有限差分在任意维度的域上离散任意阶或类型的任意微分方程我们所要做的就是离散化域并决定有限差分方案来逼近域内部所有离散点的导数。

Now we can use finite differences to discretize any differential equation of any order or type, on a domain in any dimension. All we have to do is discretize the domain and decide on finite difference schemes to approximate the derivatives at all the discrete points in the interior of the domain.

示例:离散一维热方程 t = α XX 在区间的内部 X ε 0 , 1

Example: Discretize the one-dimensional heat equation u t = α u xx in the interior of the interval x ( 0 , 1 )

这是一维有界空间域上的二阶线性偏微分方程。这里,u = u ( x , t ) 是两个变量的函数,因此我们的离散化方案应该解决两个坐标。我们可以只在空间上离散而保持时间连续,也可以只在时间上离散而保持空间连续,或者在空间和时间上都离散。拥有不止一条数字路线是很常见的。选项很好。如果我们在空间和时间上都离散化,那么我们最终会得到一个代数方程组。如果我们只在空间上而不是在时间上离散,那么我们最终会得到一个常微分方程组。由于偏微分方程是线性的,因此离散系统也是线性的。

This is a second-order linear partial differential equation on a bounded spatial domain in one dimension. Here, u = u(x,t) is a function of two variables, so our discretization scheme should address both coordinates. We can discretize only in space and keep time continuous, only in time and keep space continuous, or both in space and time. It is common to have more than one numerical route. Options are good. If we discretize in both space and time, then we end up with a system of algebraic equations. If we discretize only in space and not in time, then we end up with a system of ordinary differential equations. Since the PDE is linear, the discretized system is linear as well.

让我们写出一个完整的离散方案。为了在空间中离散化,我们使用二阶中心差分来近似二阶导数。为了及时离散化,我们使用前向差分来近似一阶导数:

Let’s write down a full discrete scheme. To discretize in space, let’s use a second-order centered difference to approximate the second derivative. And to discretize in time, let’s use a forward difference to approximate the first derivative:

,j+1 - ,j s = +1,j -2 ,j + -1,j H 2 为了 = 1 , 2 , , n j = 0 , 1 , 2 ,

在这样的方程中,u ( x , t ) 在某个初始时间 ( u ( x ,0) = g ( x ) )是已知的,并且我们想知道u ( x , t ) 时间的演变。数值格式中,下标i代表离散空间,j代表离散时间。因此,初始条件的离散模拟是 ,0 = G ,我们想要解决未知数 ,j+1 对于i = 1, 2, ...​, nj = 1, 2,...​。我们可以轻松隔离 ,j+1 在前面的数值方案中:

In such equations, u(x,t) can be known at some initial time (u(x,0) = g(x)), and we want to know that u(x,t) time evolves. In the numerical scheme, the subscript i stands for discrete space, and j stands for discrete time. Thus, the discrete analog of the initial condition is u i,0 = g i , and we want to solve for the unknowns u i,j+1 for i= 1, 2, …​, n, and j= 1, 2,…​. We can easily isolate u i,j+1 in the previous numerical scheme:

,j+1 = s H 2 +1,j - 2 ,j + -1,j + ,j 为了 = 1 , 2 , , n j = 0 , 1 , 2 ,

最后,我们用它来找到的值 ,1 , ,2 ,...(时间向前)对于i = 1, 2, ...​, n。例如,我们插入j=0来查找第一个时间步的离散u值:

Finally, we use this to find the values of u i,1 , u i,2 ,…​ (forward in time) for i= 1, 2, …​, n. For example, we plug in j=0 to find the values of discrete u at the first time step:

,1 = s H 2 +1,0 - 2 ,0 + -1,0 + ,0 为了 = 1 , 2 , , n = s H 2 G +1 - 2 G + G -1 + G 为了 = 1 , 2 , , n

请注意,我们知道g的所有离散值,因此我们现在知道离散值 ,1 以及。接下来,我们代入j=1来查找离散值 ,2 在下一个时间步,等等。

Note that we know all the discrete values of g, so we now know discrete u i,1 as well. Next, we plug in j=1 to find the values of discrete u i,2 at the next time step, and so on.

有限元

Finite Elements

有限元方法与有限差分方法的不同之处在于它们的操作偏微分方程的弱表述,而不是直接在偏微分方程上运算。弱公式是加权和平均,因此我们正在考虑积分和分部积分。我们很快就会回到这个话题。

Finite element methods are different than finite difference methods in the sense that they operate on a weak formulation of the PDE as opposed to operating directly on the PDE. A weak formulation is weighted and averaged, so we are thinking integrals and integration by parts. We will come back to this shortly.

在讨论有限元的一般概念之前,让我们先观察一下图 13-10 。这显示了圆形域顶部偏微分方程的有限元解。域的离散化使用三角网格,并且解似乎是通过分段线性函数来近似的。我们可以使用其他多边形形状作为网格,并且可以使用比分段线性更平滑的函数,例如分段二次或更高次多项式。更平滑的代价是更多的计算。

Before discussing the general idea of finite elements, let’s observe Figure 13-10 a little. This shows a finite element solution of a PDE on top of a circular domain. The discretization of the domain uses a triangular mesh, and the solution seems to be approximated by a piecewise linear function. We can use other polygonal shapes for the meshes, and we can use smoother functions than piecewise linear, such as piecewise quadratic or higher degree polynomials. The trade-off for more smoothness is more computation.

电子邮件1310
图 13-10。圆形域上的有限元解(图像来源

让我们演示一下有限元方法如何给出以下 PDE 解的数值近似:

Let’s demonstrate how the finite element method gives a numerical approximation to the solution of the following PDE:

- Δ X , y = F X , y 为了 X , y ε Ω 2 X , y = 0 为了 X , y ε n d A r y Ω

这是泊松方程(出现在静电学中)。不存在时间演化。f ( x , y ) 已指定,我们正在寻找一个在整个边界处为零的未知函数u ( x , y ),并且其二阶导数 XX yy 加起来为 – f ( x , y )。这个偏微分方程已经得到很好的研究,我们有其解析解的公式,但我们只对使用有限元方法的解的数值近似感兴趣。

This is a Poisson equation (appears in electrostatics). There is no time evolution. f(x,y) is specified, and we are looking for an unknown function u(x,y) that is zero at the entire boundary, and whose second derivatives u xx and u yy add up to –f(x,y). This PDE is very well studied and we have formulas for its analytical solution, but we are only interested in the numerical approximation of the solution using the finite element method.

为此,我们将使用有限维空间中的已知函数生成无限维空间未知函数u(x,y)的近似值。有限维空间仅由有限多个线性独立函数跨越。我们可以选择这些基函数,因此我们确保我们的选择使我们的计算变得非常容易。我们通常选择分段线性函数或分段多项式函数,每个函数在网格上的支持程度最低。这意味着基函数仅在网格的一个或两个相邻元素之上不为零,而在其他地方均为零。因此,在偏微分方程的整个域上涉及此函数的积分将减少为仅在网格的一个或两个元素上的积分。

For this, we will produce an approximation of the unknown function u(x,y) that lives in an infinite dimensional space using a known function that lives in a finite dimensional space. Finite dimensional spaces are spanned only by finitely many linearly independent functions. We get to choose these basis functions, so we make sure that our choice makes our computations very easy. We usually choose piecewise linear functions or piecewise polynomial functions, each supported minimally on the mesh. This means that the basis function is nonzero only on top of one or two adjacent elements of the mesh, and zero everywhere else. Thus, an integration involving this function on the whole domain of the PDE would reduce to an integration on only one or two elements of the mesh.

选择这些基函数后,每个基函数都支持几个网格元素,我们通过这些简单且局部支持的基函数的线性组合来近似真实解u(x,y) :

After choosing these basis functions, each supported on a few mesh elements, we approximate the true solution u(x,y) by a linear combination of these easy and locally supported basis functions:

X , y 1 A s s 1 X , y + 2 A s s 2 X , y + n A s s n X , y

现在我们必须找到常数 的线性组合。因此,我们将问题从求解连续统中的未知函数u(x,y)简化为求解未知的系数向量 1 , 2 , , n 。我们必须选择它们以便近似 1 e e e n t 1 X , y + 2 e e e n t 2 X , y + n e e e n t n X , y 在某种意义上满足偏微分方程。我们有n 个未知数,因此必须编写n 个方程并求解包含n 个方程、n 个未知数的系统。我们从偏微分方程或其弱公式中得到这些。为了获得偏微分方程的弱公式,我们将其乘以函数v(x,y),在我们的域上积分,然后使用分部积分来消除高阶导数。请记住,我们的导数越少,我们得到的函数就越接近未知函数。让我们逐步执行此操作:

Now we must find the constants u i of the linear combination. Therefore, we reduce our problem from solving for the unknown function u(x,y) in the continuum to solving for the unknown vector of coefficients ( u 1 , u 2 , , u n ) . We must choose them so that the approximation u 1 e l e m e n t 1 ( x , y ) + u 2 e l e m e n t 2 ( x , y ) + u n e l e m e n t n ( x , y ) satisfies the PDE, in some sense. We have n unknowns, so we must write n equations and solve a system of n equations, n unknowns. We get these from the PDE, or its weak formulation. To get a weak formulation of the PDE, we multiply it by a function v(x,y), integrate over our domain, then use integration by parts to get rid of higher-order derivatives. Remember that the fewer derivatives we have, the closer to the unknown function we get. Let’s do this step by step:

原偏微分方程为:

The original PDE is:

- Δ X , y = F X , y 为了 X , y ε Ω 2 X , y = 0 为了 X , y ε n d A r y Ω

将偏微分方程乘以函数v(x,y)并在整个域上积分。这是偏微分方程的弱公式,因为它以积分形式而不是逐点形式满足:

Multiply the PDE by a function v(x,y) and integrate over the entire domain. This is a weak formulation of the PDE since it makes it satisfied in an integral form as opposed to a point-by-point form:

- Ω Δ X , y v X , y d X d y = Ω F X , y v X , y d X d y

注意操作员 Δ = ,两个导数算子的点积。分部积分可以帮助我们摆脱其中一个导数,将其转移到积分内的另一个函数。这不是免费的。在执行此操作的过程中,它会获取一个负号和另一个在域边界上运行的积分项。这个新的边界积分积分了两个反导数的乘积。边界项需要边界的向外单位法向量 n :

Note the operator Δ = . , the dot product of two derivative operators. Integration by parts helps us get rid of one of the derivatives by moving it over to the other function inside the integral. This doesn’t come for free. In the process of doing this, it picks up a negative sign and another integral term that operates on the boundary of the domain. This new integral on the boundary integrates the product of two anti-derivatives. The boundary term needs the outward unit normal vector to the boundary n :

Ω X , y v X , y d X d y - ndAry Ω v X , y X , y n d s = Ω F X , y v X , y d X d y

我们可以在边界上选择v ( x , y ) = 0 ,这使得整个边界项消失:

We can choose v(x,y) = 0 on the boundary and that makes the whole boundary term disappear:

Ω X , y v X , y d X d y = Ω F X , y v X , y d X d y

现在我们将u(x,y)替换为其有限维近似值:

Now we replace u(x,y) with its finite dimensional approximation:

Ω 1 A s s 1 X , y + 2 A s s 2 X , y + n A s s n X , y v X , y d X d y = Ω F X , y v X , y d X d y

这相当于:

which is equivalent to:

Ω 1 A s s 1 X , y + 2 A s s 2 X , y + n A s s n X , y v X , y d X d y = Ω F X , y v X , y d X d y

就是这样:我们可以为v(x,y)选择n 个不同的函数,以获得n 个未知数的n个不同方程( 是未知数)。一个共同的主题是,每次我们选择时,我们都会选择不会使我们的计算生活复杂化的东西。v(x,y)最简单的选择是n我们已有的其他(正交性),并且当对其自身进行积分时产生数字 1(正态性)。我们最初选择的基函数形成了一组正交函数。一切都是为了让我们的生活更轻松。因此,n 个方程为:

This is it: We can choose n different functions for v(x,y) to get n different equations in n unknowns (the u i ’s are the unknowns). A common theme is that every time we get to pick, we pick things that do not complicate our computation life. The easiest choices for v(x,y) are the n basis functions that we already have, since these produce many cancellations when integrated against each other (orthogonality), and when integrated against themselves produce the number 1 (normality). The basis functions that we originally choose form an orthonormal set of functions. All for the business of making our life easier. Therefore, the n equations are:

Ω 1 A s s 1 X , y + 2 A s s 2 X , y + n A s s n X , y A s s 1 X , y d X d y = Ω F X , y A s s 1 X , y d X d y Ω 1 A s s 1 X , y + 2 A s s 2 X , y + n A s s n X , y A s s 2 X , y d X d y = Ω F X , y A s s 2 X , y d X d y Ω 1 A s s 1 X , y + 2 A s s 2 X , y + n A s s n X , y A s s n X , y d X d y = Ω F X , y A s s n X , y d X d y

最后,我们求解由n 个方程、n 个未知数组成的系统,我们将其设置为线性代数形式(其中 = A s s ):

Finally we solve the system of n equations, n unknowns, which we set in a linear algebra form (where b i = b a s i s i ):

Ω 1 X , y 1 X , y d X d y Ω 2 X , y 1 X , y d X d y Ω n X , y 1 X , y d X d y Ω 1 X , y 2 X , y d X d y Ω 2 X , y 2 X , y d X d y Ω n X , y 2 X , y d X d y Ω 1 X , y n X , y d X d y Ω 2 X , y n X , y d X d y Ω n X , y n X , y d X d y 1 2 n = Ω F X , y 1 X , y d X d y Ω F X , y 2 X , y d X d y Ω F X , y n X , y d X d y

回想一下,我们知道函数f(x,y),所有基函数 = A s s ,和域 Ω ,所以我们要做的就是求解方程组。该系统是稀疏的,因为大多数积分为零。因此我们选择了支持度较小的基函数。我们从来不想求解一个密集的方程组。

Recall that we know the function f(x,y), all the basis functions b i = b a s i s i , and the domain Ω , so all we have to do is solve the system of equations. This system is sparse because most of these integrals are zero. We chose the basis functions with small support for this reason. We never want to solve a dense system of equations.

当然,我们有很多问题,并且有大量关于有限元的文献来处理这些问题:

Of course, we have many questions and a rich literature on finite elements that deals with them:

所有偏微分方程都有一个弱公式可以让我们做这样的事情吗?
Do all PDEs have a weak formulation that allow us to do things like this?

是的,因为我们总是可以将偏微分方程与v函数相乘并进行分部积分,但某些偏微分方程比其他偏微分方程具有更好的结构来执行简化计算。

Yes, because we can always multiply PDEs with v functions and integrate by parts, but some PDEs have better structures to carry out simplifying computations than others.

偏微分方程的能量公式变分原理怎么样,它们有相关性吗?
What about the energy formulations or variational principles of PDEs, are they related?

是的,它们是相关的。查找 Ritz 方法。我们在第 10 章中暗示了这一点,当时我们将最小化能量泛函与求解偏微分方程联系起来。这里要记住的一件事是,大多数偏微分方程都有弱公式,但并非所有偏微分方程都有能量最小化公式。我获得博士学位的原因之一。是我发现了我正在研究的偏微分方程的能量公式。这完全是偶然。我所做的只是一种幸运的弱公式,然后是各个部分的集成。就像我们在本节中所做的那样。在此生中,尝试和错误被低估了。

Yes, they are related. Look up the Ritz method. We hinted at this in Chapter 10, when we related minimizing energy functionals to solving PDEs. One thing to keep in mind here is that most PDEs have a weak formulation, but not all of them have an energy minimization formulation. One of the reasons I got my Ph.D. was that I discovered an energy formulation for the PDE that I was working with. It was by complete chance. All I did was one lucky weak formulation, followed by an integration by parts. Just like we did in this section. Trial and error are underestimated in this life.

那么索博洛夫空间呢?为什么我们要在偏微分方程的高级课程中研究它们?
What about Sobolov spaces, why do we study them in advanced courses on PDEs?

因为我们需要在适当的函数空间中设置函数uv和基函数,这告诉我们正在使用的所有计算和近似都是有效的。例如,我们不希望包含我们的函数及其导数的积分爆炸。

Because we need to set our functions u, v, and the basis functions in the appropriate function spaces that tell us that all the computations and approximations that we are using are valid. For example, we don’t want the involved integrals that contain our functions and their derivatives to blow up.

我们可以使用非均匀网格来调整域中更详细的部分,如图13-11中的部分吗?
Can we use nonuniform meshes to adjust for the more detailed part of the domain, such as the one in Figure 13-11?

是的,我们的讨论中没有任何内容严格依赖于统一的网格。

Yes, nothing in our discussion relies strictly on a uniform mesh.

我们需要多少个基函数?
How many basis functions do we need?

与我们的网格元素一样多。

As many as our mesh elements.

近似解在什么条件下收敛到真实解?
Under what conditions does the approximate solution converge to the true solution?

欢迎来到有限元分析。

Welcome to finite element analysis.

在应用程序中如何使用它?
How is this used in applications?

每时每刻。它从力学和结构设计开始:载荷、应力和应变;但现在有限元方法被用来数值求解各种空间域具有复杂几何形状的偏微分方程。

All the time. It started with mechanics and structural designs: loads, stresses, and strains; but now the finite element method is used to numerically solve all kinds of PDEs whose spatial domains have complex geometries.

可能会出什么问题?
What could go wrong?

一如既往,维度的诅咒。我们需要更多的网格元素来获得更高的分辨率,因此我们最终必须求解的方程组随着网格元素的数量呈指数增长。没有布埃诺。理想情况下,我们希望网格在不需要的地方不详细,而在域中更有趣的部分更详细。

As always, the curse of dimensionality. We need more mesh elements for higher resolution, so the system of equations that we end up having to solve grows exponentially with the number of mesh elements. No bueno. Ideally, we want a mesh that is not detailed in places where it doesn’t need to be, and more detailed in more interesting parts of the domain.

还有什么会使事情变得复杂?
What else could complicate matters?

对于域随时间演化的偏微分方程,我们需要相应地随时间演化的网格。

For PDEs whose domain evolves in time, we need meshes that evolve in time accordingly.

人工智能能否帮助学习给定特定几何形状和偏微分方程的适当网格?
Can AI help learn an appropriate mesh given a certain geometry and PDE?

是的,我们很快就会在本章中看到这一点。

Yes, we will see this soon in this chapter.

在继续之前,我们注意到有限元方法是一种有限维网格相关方法,可近似偏微分方程的解。在本章后面我们将学习无网格神经网络方法。

Before moving on, we note that the finite element method is a finite dimensional mesh dependent method that approximates the solution of the PDE. Later in this chapter we will learn about meshless neural network methods.

电子邮件1311
图 13-11。具有非均匀三角形网格的二维域(图像源

变分法或能量法

Variational or Energy Methods

一些偏微分方程非常特殊,因为它们的解最小化了能量泛函。我们说这样的偏微分方程具有变分原理。我们刚刚使用有限元求解的泊松方程就是这些幸运的偏微分方程之一。当偏微分方程具有变分原理时,它为我们通过研究它恰好最小化的能量泛函来理解其解决方案开辟了另一条途径。

Some PDEs are very special in the sense that their solution minimizes an energy functional. We say that such a PDE possesses a variational principle. The Poisson equation, the one we just solved using finite elements, is one of these lucky PDEs. When a PDE possesses a variational principle, it opens for us another route to understanding its solution by studying the energy functional that it happens to minimize.

让我们写出泊松方程及其解最小化的能量泛函,而不详细说明为什么会出现这种情况:

Let’s write the Poisson equation and the energy functional that its solution minimizes, without going through the details of why this is the case:

Δ X , y = F X , y 为了 X , y ε Ω 2 X , y = 0 为了 X , y ε n d A r y Ω
X , y = Ω |X,y| 2 + 2 F X , y d X d y

现在我们可以利用这一新知识来数值近似偏微分方程的解:寻找能量泛函的近似最小化方案。与有限元方法类似,我们投影无限维解 X , y 到有限维空间,我们可以选择基本元素:

Now we can exploit this new knowledge to numerically approximate the solution of the PDE: look for an approximate minimizing scheme of the energy functional. Similar to the finite elements method, we project our infinite dimensional solution u ( x , y ) onto a finite dimensional space, where we get to choose the basis elements:

X , y 1 A s s 1 X , y + 2 A s s 2 X , y + n A s s n X , y

我们必须再次求解数字 1 , 2 , , n 。为此,我们插入近似值 X , y 代入能量泛函公式。由于我们知道所有基本元素,因此现在这是一个函数 1 , 2 , , n ,我们可以使用标准微积分方法将其最小化。完毕!

and we must again solve for the numbers ( u 1 , u 2 , , u n ) . To do this, we plug the approximate u ( x , y ) into the formula of the energy functional. Since we know all the basis elements, this is now a function of ( u 1 , u 2 , , u n ) , which we can minimize using standard calculus methods. Done!

这种方法相当通用,可以顺利地向我们介绍变分法,即寻找泛函的最优值,而不是像正常情况那样寻找函数的最优值。结石。

This method is fairly general and introduces us smoothly into the calculus of variations, which is about finding optima of functionals instead of functions, like in normal calculus.

蒙特卡罗方法

Monte Carlo Methods

我们是现在习惯于将我们的大脑切换到概率思维来解决确定性问题。我们通过随机梯度下降来最小化损失函数、大型矩阵的乘法、大型矩阵的随机奇异值分解以及图上的随机游走来识别社区、对网页进行排名和其他目的。蒙特卡罗方法最著名的介绍性示例是:

We are now used to switching our brain to probabilistic thinking to solve deterministic problems. We did that with the stochastic gradient descent for minimizing loss functions, multiplication of large matrices, randomized singular value decomposition of a large matrix, and random walks on graphs to identify communities, rank web pages, and other purposes. The most famous introductory examples to Monte Carlo methods are:

  • 估计 π 通过生成许多随机点 X rAnd , y rAnd 在一个单位正方形中,求半径为 1 的内切四分之一圆内的比例: X rAnd 2 + y rAnd 2 1 。现在我们可以估计落在四分之一圆中的概率:

    r 观点 里面 四分之一 圆圈 = 区域四分之一圆圈半径1 区域单元正方形 = π 4 数字观点里面四分之一圆圈 全部的数字生成的
  • Estimating π by generating many random points ( x random , y random ) in a unit square, and finding the proportion that lies inside the inscribed quarter circle of radius 1: x random 2 + y random 2 1 . Now we can estimate the probability of landing in the quarter circle:

    P r o b ( point inside the quarter circle ) = areaofquartercircleofradius1 areaofunitsquare = π 4 numberoftimespointisinsidethequartercircle totalnumberofpointsgenerated
  • 估计积分 A F X d X 通过生成随机点来计算区间 [a,b] 上的非负连续函数f(x) X rAnd , y rAnd , 在哪里 A X rAnd 0 y rAnd 最大限度 F 。积分值是f图形下方的面积。我们可以通过找到随机点位于f(x)图下方的次数比例来估计它,或者 y rAnd F X rAnd :

    r 观点 在下面 图形 F = 区域在下面图形F 全部的区域长方形 = A FXdX -A×AXF 数字观点在下面图形F 全部的数字生成的
  • Estimating the integral a b f ( x ) d x of a nonnegative and continuous function f(x) over an interval [a,b] by generating random points ( x random , y random ) , where a x random b and 0 y random max ( f ) . The value of the integral is the area under the graph of f. We can estimate it by finding the proportion of times that the random point lies under the graph of f(x), or y random f ( x random ) :

    P r o b ( point under the graph of f ) = areaunderthegraphoff totalareaoftherectangle = a b f(x)dx (b-a)×max(f) numberoftimespointisunderthegraphoff totalnumberofpointsgenerated

这种解决确定性问题的随机方法被称为蒙特卡洛方法,因为它们涉及重复的机会游戏,并计算某些结果的比例,就像摩纳哥蒙特卡洛赌场的赌博一样。它们也可以被称为拉斯维加斯大道方法。这类似于回答确定性问题的随机对照试验,例如评估某种药物对特定人群的效果。回答同一问题的另一种方法是完全确定性的观察性研究,其中控制所有可疑的混杂变量,并评估药物干预的效果。

Such stochastic methods to solve deterministic problems are called Monte Carlo methods because they involve repetitive games of chance, and counting proportions of certain outcomes, just like gambling in Monte Carlo casinos in Monaco. They could have been called Las Vegas Strip methods as well. This is analogous to randomized controlled trials to answer deterministic questions, for example, to assess the effect of a certain drug on a given population. Another way to answer the same question would be a completely deterministic observational study, where one controls for all suspected confounding variables, and assesses the effects of the drug intervention.

现在假设我们有一个确定性偏微分方程,并且我们希望使用随机数值试验(蒙特卡罗)找到其解决方案。为了说明这是如何工作的,让我们使用一个简单的偏微分方程:

Now suppose we have a deterministic PDE and we want to find its solution using randomized numerical trials (Monte Carlo). To illustrate how this works, let’s use a simple PDE:

Δ X , y = 0 为了 X , y 单元 正方形 2 , X , y = G X , y 为了 X , y ε n d A r y sqAre

我们首先使用均匀网格对域进行离散化,然后为内部网格点处的偏微分方程和边界条件编写有限差分格式:

Let’s first discretize the domain using a uniform grid, then write a finite difference scheme for the PDE at the interior grid points and for the boundary conditions:

+1,j -2 ,j + -1,j H 2 + ,j+1 -2 ,j + ,j-1 H 2 = 0 什么时候 , j 对应 一个 内部的 网格 观点 , ' ,j ' = G ' ,j ' 什么时候 ', j ') 对应 A 边界 观点

目标是使用数值方案来找到 ,j 对于网格的每个内部点。这将是该特定内点处真实解u(x,y)的数值估计。让我们来解决 ,j :

The goal is to use the numerical scheme to find u i,j for each interior point of the grid. This will be the numerical estimate of the true solution u(x,y) at that particular interior point. Let’s solve for u i,j :

,j = 1 4 +1,j + 1 4 -1,j + 1 4 ,j+1 + 1 4 ,j-1 什么时候 , j 对应 一个 内部的 网格 观点 , ' ,j ' = G ' ,j ' 什么时候 ', j ') 对应 A 边界 观点

这就是我们解释随机游走设置的方程的方式:如果我们位于边界,那么我们就知道解,它是 ' ,j ' = G ' ,j ' 。因此,遵循 PDE 方案的指导来构建其行走的随机行走者将获得奖励 G ' ,j ' 在边界点。此外,解决方案 ,j 内部网格点 ( i , j ) 是四个周围网格点解的未加权平均值。因此,如果随机游走者从内部网格点 ( i , j ) 开始,我们将给他们 0.25 的概率,让他们游走到四个相邻点中的任何一个,然后游走到他们的相邻点,直到游走者到达边界网格点 ( i ', j '),他们在那里领取奖励 G ' ,j ' 。这只是对 PDE 方案的一次探索,在某种程度上为我们提供了关于哪个边界点对解决方案有贡献的一小部分信息 ,j 。如果我们重复这个过程很多次,比如说一千次,所有这些都从我们想要找到数值解的同一个网格点(ij)开始,那么我们可以计算随机游走器在每个点结束的次数比例边界点:

This is how we interpret the equations for a random walk setting: if we are at the boundary, then we know the solution, it is u i ' ,j ' = g i ' ,j ' . So a random walker who is following the guidance of the PDE scheme to structure their walk would collect their reward g i ' ,j ' at the boundary point. Moreover, the solution u i,j at an interior grid point (i,j) is the unweighted average of the solution at the four surrounding grid points. So if a random walker starts at an interior grid point (i,j), we will give them a 0.25 probability of wandering off to any of their four neighboring points, then to their neighboring point, until the walker hits a boundary grid point (i',j'), where they collect their reward g i ' ,j ' . That would be only one exploration of the PDE scheme, in a way getting us a tiny piece of information on which boundary point contributed to the solution u i,j . If we repeat this process may times, say a thousand times, all starting from the same grid point (i,j) where we want to find the numerical solution, then we can count the proportion of times the random walker ended up at each of the boundary points:

p r 结尾 向上 观点 ' , j ' 数字随机的沃克结束了向上观点 ' ,j ' 全部的数字随机的散步开始,j

这将为我们提供所有边界点的预期奖励的估计,这正是我们正在寻找的数值解:每个边界值在内点的解中如何发挥作用?所以偏微分方程的数值解为:

This will give us an estimate of the expected rewards from all the boundary points, which is exactly the numerical solution that we are looking for: how does each boundary value play a role in the solution at the interior point? So the numerical solution of the PDE is:

,j = Σ ' ,j ' p r 结尾 向上 观点 ' , j ' G ' ,j '

这是一种获得数值解的巧妙方法,不需要求解线性方程组(线性方程组可能非常大且不受欢迎)。当我们只关心在几个点上寻找解决方案而不是在整个网格上寻找解决方案时,它也非常好。

This is such a neat way of getting a numerical solution that does not involve solving a linear system of equations (which could be very large and undesirable). It is also excellent when we care about finding the solution at a few points only, as opposed to finding the solution at the entire grid.

当然,对于每个偏微分方程,我们必须设计正确的数值方案以及随机游走器的转移概率。例如,如果偏微分方程的系数乘以二阶导数,则随机游走器不会以 0.25 的相同概率漂移到四个相邻点中的每一个。这些系数将为每个邻居引入权重,因此我们调整每个邻居的转移概率。

Of course, for each PDE we have to devise the correct numerical scheme along with the transition probabilities of the random walker. For example, if the PDE had coefficients multiplied with the second derivatives, then the random walker would not wander off to each of the four neighboring points with equal probability of 0.25. The coefficients would introduce weights for each neighbor, so we adjust the transition probabilities to each.

从理论上讲,我们必须证明步行者最终确实会撞到边界,并且通过这种方式获得的数值解确实收敛于偏微分方程的真实解析解。我们还必须获得以下方面的分析估计:随机游走停止之前需要多长时间(平均而言)、数值解收敛的速度以及以这种方式获得的数值解与通过有限差分或有限元获得的数值解在精度方面的差异、计算成本和收敛速度。

In terms of theory, we have to prove that a walker does eventually hit the boundary, and that the numerical solution obtained this way does converge to the true analytical solution of the PDE. We also have to obtain analytical estimates on how long until a random walk stops (on average), how fast the numerical solution converges, and how a numerical solution obtained this way fares against the ones obtained from finite differences or finite elements in terms of accuracy, computation cost, and speed of convergence.

对于蒙特卡罗方法,有时我们会以相反的方式开始。我们设计了一个涉及不同过程和转换率的模拟,模仿一些物理现象(例如系统中相互作用的粒子),然后我们对其进行平均并转换为编写涉及手头系统描述符的偏微分方程。这与从偏微分方程开始,然后设计粒度尺度蒙特卡罗模拟来解决它完全相反。我们接下来讨论这个。

With Monte Carlo methods, sometimes we start the other way around. We devise a simulation involving the different processes and transition rates mimicking some physical phenomena (such as interacting particles of a system), then we average that and transition to writing PDEs involving the descriptors of the system at hand. This is the exact opposite of starting with a PDE, then devising a granular scale Monte Carlo simulation to solve it. We discuss this next.

一些统计力学:奇妙的主方程

Some Statistical Mechanics: The Wonderful Master Equation

之一我最喜欢的偏微分方程是统计力学的主方程,因为它是为数不多的偏微分方程之一,能够让我们从原子或分子尺度(粒子系统)表征系统,其中描述是概率性的,宏观尺度上的同一系统,其中的描述是确定性的。期望潜在的原子过程和转变会在宏观尺度上产生观察到的行为是合乎逻辑的。一位非常聪明的人向我介绍了统计力学,他告诉我:“我们的整个生活经历难道不是某种大规模潜在化学反应集体行为的结果吗?”

One of my favorite PDEs is the master equation from statistical mechanics, because it is one of the few PDEs out there that is able to get us from characterizing a system at an atomistic or molecular scale (system of particles), where the description is probabilistic, to the same system at a macro scale, where the description is deterministic. It is only logical to expect that the underlying atomistic processes and transitions give rise to an observed behavior at the macro scale. The very wise person who introduced me to statistical mechanics told me: “Isn’t our whole lived experience the result of the collective behavior of some massive underlying chemical reaction?”

从原子概率的主方程到观测量的确定性偏微分方程的转变是简洁的,不会让人感觉我们作弊或做出模糊假设,或者我们有两个完全不相连的模型,一个是宏观尺度的,另一个是原子尺度的,彼此无关。

The transition from the master equation for atomistic probabilities to deterministic PDEs for observed quantities is neat and doesn’t feel like we cheated or made fuzzy assumptions, or that we have two completely disconnected models, one at a macro scale and another at an atomistic scale, that have nothing to do with each other.

主方程跟踪统计系统(某些粒子)在特定时间处于某种状态的概率的演变。我们通过从收益中减去损失并考虑不同状态之间的转换率来计算系统状态概率的变化率:

The master equation tracks the evolution of the probability of a statistical system (some particles) being in some state at a certain time. We calculate the rate of change of the probability of the state of the system by subtracting the losses from the gains and accounting for transition rates between different states:

H,t t = Σ H ' H ' , t 时间 H ' H - H , t 时间 H H ' , : = L

在哪里 时间 H H ' 时间 H ' H 是从状态hh' 的转换率,反之亦然。我们使用系统的基本物理假设或观察来计算这些转变率,例如原子的蒸发和凝结率、扩散率等。

where T ( h h ' ) and T ( h ' h ) are the transition rates from state h to h’, and vice versa. We calculate these transition rates using the underlying physical assumptions or observations on the system, for example, evaporation and condensation rates of atoms, diffusion rates, etc.

现在我们可以使用主方程通过计算系统的确定性描述符的期望来编写偏微分方程。期望将我们从概率量转变为确定性量。我们的计算方法如下:

Now we can employ the master equation to write partial differential equations for the deterministic descriptors of the system by computing their expectations. Expectations transition us from probabilistic quantities to deterministic ones. Here’s how we calculate it:

F = Σ H F H , t = Σ H F e -HH/K时间 Z

其中H ( h ) 是总能量,Z是配分函数。表达方式 e -HH/K时间 Z 在统计力学中非常常见,它表达了一种直观的想法,即高能量状态出现的可能性呈指数级下降,这意味着系统更喜欢低能量并朝着较低总能量的状态演化。

where H(h) is the total energy, and Z is the partition function. The expression e -H(h)/KT Z is very common in statistical mechanics and expresses the intuitive idea that states with high energy are exponentially less likely to occur, meaning that systems prefer low energy and evolve toward the states that lower the total energy.

蒙特卡洛模拟的 N 次重复的期望与平均值

Expectation Versus Averaging over N Repetitions of a Monte Carlo Simulation

这个期待 F 与蒙特卡罗模拟有关。它相当于极限 无穷大 均值的 F 蒙特卡罗模拟重复N次后f的值。

This expectation f is related to Monte Carlo simulations. It is equivalent to the limit as N of the mean f ¯ of f after N repetitions of the Monte Carlo simulations.

我们现在可以计算利息数量期望的变化率,例如 H 表示某个位置i的晶体轮廓(由原子组成)的高度,使用主方程:

We can now compute the rate of change of the expectation of the quantity of interest, say h i representing the height of a crystal profile (made up of atoms) at a certain site i, using the master equation:

dH dt = Σ H H t = Σ H H L

如果系统是封闭的,也就是说如果我们能够用h及其相对于空间和时间的导数来表达右侧,那么我们就可以获得预期高度轮廓的运动方程。如果系统不是封闭的,那么我们必须做一个近似来封闭系统。我们最好做一个物理上合理的近似,比如系统接近平衡,否则我们的努力将毫无用处。

If the system is closed, meaning if we are able to express the righthand side in terms of h and its derivatives with respect to space and time, then we obtain an equation of motion for the expected height profile. If the system is not closed, then we have to make an approximation to close the system. We better make a physically plausible approximation, such as the system is near equilibrium, otherwise our efforts will be useless.

最后一步是对所得离散运动方程进行粗粒度化,以获得描述晶体轮廓的连续偏微分方程模型。这一步将我们从有限差分格式转移到连续偏微分方程,这与我们通过有限差分学到的离散化过程相反。使用这个过程,所得到的偏微分方程直接从原子过程中产生。这样的偏微分方程通常看起来像:

The final step is coarse-graining the resulting discrete equations of motion to obtain a continuum PDE model describing the crystal profile. This step moves us from a finite difference scheme to a continuum PDE, which is the reverse of the discretizing process we learned with finite differences. Using this process, the resulting PDE emerges directly from atomistic processes. Such a PDE usually looks like:

H t X , t = F H X , t , t ; ω

在哪里 ω 是系统物理参数的集合。

where ω is the set of the system’s physical parameters.

在本章中,我们将学习如何使用图神经网络直接从粒子系统模拟宏观尺度的自然现象。这绕过了我们在本节中所做的写下偏微分方程。网络的输入将是粒子及其相互作用和相互作用速率,输出将是系统的时间演化(视频或图形的时间序列):整个。

Soon in this chapter, we will learn about using graph neural networks to simulate natural phenomena at the macroscopic scale directly from particle systems. This bypasses writing down PDEs as we did in this section. The inputs to the network will be the particles along with their interactions and rates of interaction, and the output will be the time evolution (a video, or a time sequence of graphs) of the system as a whole.

作为底层随机过程期望的解

Solutions as Expectations of Underlying Random Processes

对于一些类型的偏微分方程,找到解决方案的一个巧妙方法是将它们表示为一些底层随机过程的期望:我们模拟适当随机过程的随机路径,然后计算期望。这使我们能够评估任何给定时空位置的解决方案。

For some types of PDEs, a neat way to find solutions is to formulate them as expectations of some underlying random processes: we simulate random paths of an appropriate stochastic process, then compute the expectation. This allows us to evaluate solutions at any given space-time locations.

为了学习如何做到这一点,我们需要学习Feynman -Kac 公式Itô 微积分(帮助我们找到与时间相关的随机变量函数的导数)。这些偏微分方程和概率很好地结合在一起。

To learn how to do this, we need to study the Feynman-Kac formula and Itô’s calculus (helps us find derivatives of functions of time-dependent random variables). These tie PDEs and probability nicely together.

Feynman-Kac 公式(我们不会写)提供了一种实用的方法来解决一些受到维数诅咒困扰的偏微分方程。例如,在定量金融中,我们可以使用 Feynman-Kac 公式来有效计算 Black-Scholes 方程的解,从而为股票期权定价。在量子化学中,我们可以用它来求解薛定谔方程。

The Feynman-Kac formula (which we will not write) offers a practical way to solve some PDEs that have been haunted by the curse of dimensionality. For example, in quantitative finance, we can use the Feynman-Kac formula to efficiently calculate solutions to the Black-Scholes equation to price options on stocks. In quantum chemistry, we can use it to solve the Schrödinger equation.

转换偏微分方程

Transforming the PDE

这里的想法很简单:也许变换空间中的偏微分方程比它当前所在的空间更容易求解(通过分析或数值)。因此,我们以某种方式对其进行改造,并祝愿一切顺利。

The idea here is simple: maybe the PDE in a transformed space is easier to solve (analytically or numerically) than in the space it currently lives. So we transform it in some way and wish for the best.

傅里叶变换

Fourier Transform

傅立叶变换是x空间到频率的积分变换 Ψ 空间:

The Fourier transform is an integral transform from x space to frequency ξ space:

F 时间 F X = F ^ Ψ = 1 2π -无穷大 无穷大 e -ΨX F X d X

傅里叶逆变换取消了傅里叶变换,让我们回到频率 Ψ 空间到x空间:

The inverse Fourier transform undoes the Fourier transform and brings us back from frequency ξ space to x space:

F 时间 -1 F ^ Ψ = F X = 1 2π -无穷大 无穷大 e ΨX F ^ Ψ d X

为了方便起见,有许多函数的傅里叶变换表格。当这些不可用时,我们求助于数值方法。由于傅里叶变换及其逆变换对于频率分析、信号调制和滤波等许多应用都很重要,因此有专门为其快速计算而开发的算法。

There are tables with the Fourier transforms of many functions for our convenience. When these are not available, we resort to numerical methods. Because the Fourier transform and its inverse are important for many applications, such as frequency analysis, signal modulation, and filtering, there are algorithms that have been specifically developed for its fast computation.

以下是我们需要了解的有关傅立叶变换的一些重要知识:

The following are some powerful things that we need to know about the Fourier transform:

  • 它将函数分解为其频率分量。函数的傅里叶变换告诉我们函数的每个频率有多少。函数f(x)的频谱是其傅里叶变换的绝对值: | F Ψ |

  • It strips a function down to its frequency components. The Fourier transform of a function tells us how much of each frequency a function has. The frequency spectrum of a function f(x) is the absolute value of its Fourier transform: | F ( ξ ) | .

  • 它有一个逆变换,允许我们在x空间和频率空间之间来回移动 Ψ

  • It has an inverse transform that allows us to move back and forth between x space and frequency space ξ .

  • 它将x空间中两个函数的卷积改变为频率中函数的乘法 Ψ 空间: F 时间 F * G X = F 时间 F X × F 时间 G X = F ^ Ψ G ^ Ψ 。这在尝试寻找偏微分方程的解析解时非常有用。求解x空间中的偏微分方程可归结为求解代数方程或更简单的微分方程 Ψ 空间,然后使用傅里叶逆变换回到x空间。很多时候,我们对两个傅里叶变换的乘积求逆,因此解决方案最终是x空间中的卷积。如果您以前遇到过解析解的格林函数,这是获得它们的一种方法。

  • It changes the convolution of two functions in x space to multiplication of functions in frequency ξ space: F . T . ( f * g ( x ) ) = F . T . ( f ( x ) ) × F . T . ( g ( x ) ) = f ^ ( ξ ) g ^ ( ξ ) . This is helpful when trying to find analytical solutions for PDEs. Solving a PDE in x space boils down to solving an algebraic equation or an easier differential equation in ξ space, then using the inverse Fourier transform to get back to x space. Many times we are inverting the product of two Fourier transforms, so the solution ends up being a convolution in x space. If you have encountered Green’s functions for analytical solutions before, this is one way to arrive at them.

  • 它将x的微分更改为乘法 Ψ , 所以:

    F 时间 X X = Ψ F 时间 X = Ψ ^ Ψ

    F 时间 XX X = - Ψ 2 F 时间 X = - Ψ 2 ^ Ψ

    摆脱衍生品的意义重大。这意味着原始空间中的微分方程变成了傅立叶空间中的代数方程。

  • It changes differentiation in x to multiplication by i ξ , so:

    F . T . ( u x ( x ) ) = i ξ F . T . ( u ( x ) ) = i ξ u ^ ( ξ )

    and

    F . T . ( u xx ( x ) ) = - ξ 2 F . T . ( u ( x ) ) = - ξ 2 u ^ ( ξ )

    Getting rid of derivatives is huge. It means that differential equations in original space become algebraic equations in Fourier space.

  • 它是一个线性变换,因此我们可以将其单独应用于偏微分方程中的每一项。它使我们能够无缝求解具有常数系数的线性偏微分方程。对于具有非常数系数的线性偏微分方程(其中参数取决于空间),如果我们愿意对这些系数的级数展开进行混乱,我们仍然可以使用傅里叶变换。当我们写系列的时候,我们必须研究它们的收敛性。

  • It is a linear transformation, so we can apply it separately to each term in a PDE. It enables us to solve linear PDEs with constant coefficients seamlessly. For linear PDEs with nonconstant coefficients (where the parameters depend on space), we can still use the Fourier transform if we are willing to get messy with series expansions of these coefficients. The moment we write series, we have to investigate their convergence.

  • 我们用它来证明神经网络的通用逼近定理(Hornik et al. 1989)。

  • We use it to prove the universal approximation theorem for neural networks (Hornik et al. 1989).

  • 我们可以用它来加速卷积神经网络(Mathieu et al. 2013)。

  • We can use it to speed up convolutional neural networks (Mathieu et al. 2013).

  • 事实证明,如果我们想要训练神经网络来学习偏微分方程的解,那么在傅里叶空间中表示偏微分方程是很方便的。

  • It turns out that representing PDEs in Fourier space is convenient if we want to train neural networks to learn PDE solutions.

还有一些不太方便的事情:

There are also some not so convenient things:

  • 许多函数都具有复值傅里叶变换。我们只是学习复杂的分析并接受它。

  • Many functions have complex valued Fourier transforms. We just learn complex analysis and live with that.

  • 并非所有函数都具有傅里叶变换。所涉及的积分在无限域上运行,因此如果积分内部没有函数通过快速衰减到零进行补偿,积分就会爆炸(使傅立叶变换变得无用)。傅里叶变换的核是 e -ΨX = 因斯 Ψ X - Ψ X 。它随频率振荡 Ψ 并且永远不会衰减到零。

  • Not all functions have a Fourier transform. The involved integral operates on an infinite domain, so if there is no function inside the integral that compensates with a rapid decay to zero, the integral blows up (rendering the Fourier transform useless). The kernel of the Fourier transform is e -iξx = cos ( ξ x ) - i sin ( ξ x ) . This oscillates with frequency ξ and never decays to zero.

  • 即使对于存在傅里叶逆变换(帮助我们找到偏微分方程的解析解)的函数,很多时候我们也不知道它的公式。在这些情况下,我们将无法使用此方法编写显式解析解。这是许多分析方法的常见问题。

  • Even for functions whose inverse Fourier transform (which helps us find analytic solutions for PDEs) exists, many times we do not know its formula. In these cases, we would not be able to write an explicit analytic solution using this method. This is a common problem for many analytic methods.

  • 傅里叶变换具有海森堡测不准原理

    研究不确定性原理的理论始于维尔纳·海森堡的论点,即不可能同时以任意精度确定自由粒子的位置和动量。在量子力学中,位置波函数是动量波函数的傅里叶变换。傅里叶不确定性原理最流行的用途是描述系统(尤其是量子力学系统)的稳定性和可测量性之间的自然权衡。假设f ( x ) 是粒子位置为x的概率,f (xi) 是粒子动量为 xi 的概率,海森堡不等式给出了这两个概率分布的分布下限。物理假设是位置和动量通过傅立叶变换相关:f |2 L 2 ≤ 4π · |( x - xo ) f | L 2 · |(Ψ - Ψo) f | L 2. 定性地讲,这意味着窄函数具有宽傅里叶变换,宽函数具有窄傅里叶变换。在任一领域中,更广泛的函数实际上意味着数据分布广泛,因此在一个领域中始终存在不确定性。

  • Fourier transforms have the Heisenberg uncertainty principle.

    The study of uncertainty principles began with Werner Heisenberg’s argument that it is impossible to simultaneously determine a free particle’s position and momentum to arbitrary precision. In quantum mechanics, the wave function of position is the Fourier transform of the wave function of momentum. The most popular use of Fourier uncertainty principles is as a description of the natural trade-off between the stability and measurability of a system, particularly quantum mechanical systems. Imagining that f(x) is the probability that a particle’s position is x, and f(ξ) is the probability that its momentum is ξ, Heisenberg’s inequality gives a lower bound on how spread out these two probability distributions must be. The physical assumption is that position and momentum are related by Fourier transform: |f|2L2 ≤ 4π · |(x - xo)f|L2 · |(ξ - ξo)f|L2. Qualitatively, this means a narrow function has a wide Fourier transform, and a wide function has a narrow Fourier transform. In either domain, a wider function means there is literally a wide distribution of data, so there always exists uncertainty in one domain.

傅里叶变换与傅里叶级数

Fourier Transform Versus Fourier Series

我们不应该将傅里叶变换与傅立叶正弦和余弦级数。功能 X 没有傅里叶变换,但有傅里叶正弦级数是它本身。

We should not confuse the Fourier transform with the Fourier sine and cosine series. The function sin x does not have a Fourier transform, but its Fourier sine series is itself.

拉普拉斯变换

Laplace Transform

拉普拉斯变换允许我们变换比傅立叶变换更广泛的函数,因为它的核 e -st 以指数方式快速衰减到零(指数中没有复数值i来搞乱事情)。拉普拉斯变换对定义的函数进行运算 [ 0 , 无穷大 ,所以在偏微分方程中,我们用它来变换时间变量,或者其他变量,如果它们的范围是 [ 0 , 无穷大 因此,我们不是直接在时域求解偏微分方程,而是对其进行拉普拉斯变换,在s域求解,然后将其逆拉普拉斯变换回时域。

The Laplace transform allows us to transform a wider class of functions than the Fourier transform, because its kernel e -st decays to zero exponentially fast (there is no complex valued i in the exponent to mess things up). The Laplace transform operates on functions defined on [ 0 , ) , so in PDEs, we use it to transform the time variable, or other variables if they range from [ 0 , ) . So instead of solving the PDE directly in the time domain, we Laplace transform it, solve it in s domain, then inverse Laplace transform it back to the time domain.

拉普拉斯变换的公式为:

The formula for the Laplace transform is:

L 时间 F t = F ^ s = 0 无穷大 e -st F t d t

拉普拉斯逆变换的公式为:

And the formula for the inverse Laplace transform is:

L 时间 -1 F ^ s = F t = 1 2π C-无穷大 C+无穷大 e st F ^ s d s

就像傅里叶变换的情况一样,为了方便起见,有一些表格包含许多函数的拉普拉斯变换。

Just like in the case of the Fourier transform, there are tables with the Laplace transforms of many functions computed for our convenience.

我们关心拉普拉斯变换如何作用于偏微分方程中涉及的(时间)导数。最好摆脱它们,因为偏微分方程如何让我们更接近它的解决方案?它确实:

We care about how the Laplace transform acts on the (time) derivatives involved in a PDE. It is better to get rid of them, because how else will a PDE get us closer to its solution? It does:

  • L 时间 t X , t = s ^ X , s - X , 0

  • L . T ( u t ( x , t ) ) = s u ^ ( x , s ) - u ( x , 0 )

  • L 时间 tt X , t = s 2 ^ X , s - s X , 0 - t X , 0

  • L . T ( u tt ( x , t ) ) = s 2 u ^ ( x , s ) - s u ( x , 0 ) - u t ( x , 0 )

请注意,我们通常知道偏微分方程的初始条件 X , 0 t X , 0 ,因此变换确实消除了关于时间的导数。

Note that we usually know the initial conditions for a PDE u ( x , 0 ) and u t ( x , 0 ) , so the transformations do get rid of the derivatives with respect to time.

我们还关心卷积到乘法的性质,以便我们可以使用拉普拉斯逆变换将s中的代数表达式转换回t空间中的偏微分方程解。这里要小心,这是一个有限卷积,从0t,而不是从 - 无穷大 无穷大 就像傅立叶变换的情况一样:

We also care about the convolution to multiplication property so that we can transfer back algebraic expressions in s to PDE solutions in t space using the inverse Laplace transform. Careful here that this is a finite convolution, from 0 to t, as opposed to from - to like in the Fourier transform case:

L 时间 F * G t = F ^ s G ^ s 在哪里 F * G t = 0 t F τ G t - τ d τ = 0 t F t - τ G τ d τ

L . T . ( ( f * g ) ( t ) ) = f ^ ( s ) g ^ ( s ) where ( f * g ) ( t ) = 0 t f ( τ ) g ( t - τ ) d τ = 0 t f ( t - τ ) g ( τ ) d τ

与傅立叶变换类似,拉普拉斯变换是线性算子,因此最好与线性偏微分方程一起使用。

Similar to the Fourier transform, the Laplace transform is a linear operator, so it is best used with linear PDEs.

将偏微分方程简化为代数方程或常微分方程

Reducing PDEs to Algebraic Equations or to ODEs

傅里叶变换、拉普拉斯变换和一些其他变换,例如汉克尔和梅林变换,能够消除偏微分方程在某些变量(时间、空间等)上的导数。如果变换作用于偏微分方程中涉及的所有变量,则我们得到一个代数方程;如果变换作用于除一个变量之外的所有变量,则得到一个常微分方程 (ODE)。这里的希望是代数方程或 ODE 比原始 PDE 更容易求解,并且我们可以利用代数、数值和 ODE 中的已知方法来求解变换变量中的新方程。我们将在下一节中看到一个简单的例子来说明它是如何工作的解决方案运营商

The Fourier transform, the Laplace transform, and some other transforms, such as the Hankel and Mellin transforms, are able to get rid of derivatives of the PDE in certain variables (time, space, etc.). We are left with an algebraic equation if the transform acts on all the variables involved in the PDE, or an ordinary differential equation (ODE) if the transform acts on all the variables except one. The hope here is that the algebraic equations or the ODE are easier to solve than the original PDE, and that we can utilize known methods from algebra, numerics, and ODEs to solve the new equations in transformation variables. We will see a simple example of how this works in the next section on solution operators.

解决方案运营商

Solution Operators

我们现在通过两个简单但信息丰富的示例来说明变换方法,同时展示偏微分方程解算子背后的想法。这些都是相当普遍的,更重要的是为利用神经网络解决偏微分方程奠定了基础。此外,这两个例子都有显式解析解,因此我们可以使用它们来测试求解偏微分方程的近似或迭代方法(包括神经网络方法)。

We now work through two simple but informative examples that illustrate the transform methods, while at the same time showcasing the ideas behind solution operators for PDEs. These are fairly general, and more importantly lay the groundwork for leveraging neural networks to solve PDEs. In addition, both examples have explicit analytic solutions, so we can use them to test approximation or iterative methods for solving PDEs (including neural network methods).

第一个示例使用在无限域上具有常系数的一维热方程(这是时间相关的),第二个示例使用在具有简单几何形状的有界域上具有常系数的二维泊松方程(这不是时间相关;解在时间上是静态的)。

The first example uses a one-dimensional heat equation with constant coefficients on an infinite domain (this is time dependent), and the second example uses a two-dimensional Poisson equation with constant coefficients on a bounded domain with a simple geometry (this is not time dependent; the solution is static in time).

使用热方程的示例

Example Using the Heat Equation

无限一维棒上的热方程如下:

The heat equation on an infinite one-dimensional rod looks like:

t X , t = α XX X , t 为了 X ε , t ε 0 , 无穷大 X , 0 = 0 X 为了 X ε

由于该偏微分方程是在x 的无限域上定义的,因此我们必须指定远场条件(没有边界,因此我们必须指定我们认为解函数u(x,t)的样子,当 X 无穷大 X - 无穷大 )。我们假设这些限制为零。

Since this PDE is defined on an infinite domain in x, we must specify the far field conditions (there is no boundary, so we must specify what we think the solution function u(x,t) looks like when x and x - ). Let’s assume that these limits are zero.

为了简单起见,我们假设参数 α 是常数,所以我们可以应用傅里叶变换。当将此变换(相对于x)应用于偏微分方程时 t X , t = α XX X , t 和初始条件,我们设法摆脱x的导数并将偏微分方程简化为只有一个时间导数的常微分方程:

For simplicity, let’s assume that the parameter α is constant so we can apply the Fourier transform. When applying this transform (with respect to x) to the PDE u t ( x , t ) = α u xx ( x , t ) and the initial condition, we manage to get rid of the derivatives in x and simplify the PDE into an ordinary differential equation with only one derivative in time:

^ t Ψ , t = - α Ψ 2 ^ Ψ , t 为了 Ψ ε , t ε 0 , 无穷大 ^ Ψ , 0 = ^ 0 Ψ 为了 Ψ ε

现在,我们可以使用常微分方程中的称为变量分离的方法轻松解决这个问题(我们不关心细节),获得傅立叶空间中的解:

We can now easily solve this using a method from ordinary differential equations called separation of variables (let’s not bother with the details), obtaining the solution in Fourier space:

^ Ψ , t = e -αΨ 2 t ^ 0 Ψ

我们需要x空间中的解,而不是傅里叶空间中的解,因此我们对上述表达式进行傅里叶逆变换,并利用x空间和傅里叶空间之间变换时乘法变为卷积的知识。因此, x空间的解为:

We need the solution in x space, not in Fourier space, so we take the inverse Fourier transform of the above expression, and use the knowledge that multiplications become convolutions when transforming between x space and Fourier space. Therefore, the solution in x space is:

X , t = F 时间 -1 e -αΨ 2 t ^ 0 Ψ = F 时间 -1 e -αΨ 2 t * 0 X = 1 4παt e -X 2 4αt * 0 X = -无穷大 无穷大 1 4παt e -s-X 2 4αt 0 s d s = -无穷大 无穷大 k e r n e s , X ; t ; α 0 s d s

计算的妙语是这样的:

The punch line from the calculation is this:

偏微分方程的解 u(x,t) 是某个核函数的积分 k s , X ; t ; α 针对解决方案的初始状态 0 s

The solution u(x,t) of the PDE is the integral of some kernel function k ( s , x ; t ; α ) against the initial state of the solution u 0 ( s ) .

此外,偏微分方程的解算子映射给定的输入数据,在本例中是参数 α 和初始状态 0 X ,通过将初始状态与依赖于 PDE 参数(以及其对空间和时间的依赖性)的某个核函数进行积分,得到输出,即我们正在寻求的解u ( x , t )。

Moreover, the solution operator of the PDE maps the given input data, which in this case are the parameter α and initial state u 0 ( x ) , to the output, which is the solution we are seeking, u(x,t), by integrating the initial state against some kernel function that depends on the parameter of the PDE (along with its dependence on space and time).

知道这个核的公式,或者使用神经网络来近似它,就可以解开给定偏微分方程的解。然后我们说神经网络学习了偏微分方程的解算子

Knowing the formula of this kernel, or using a neural network to approximate it, unlocks the solution of a given PDE. We then say that the neural network learned the solution operator of the PDE.

在我们的简单示例中,我们利用线性和常数系数在返回到真实空间时结合傅里叶变换方法和卷积,并有幸计算出积分核的显式解析公式,即 k s , X ; t ; α = 1 4παt e -s-X 2 4αt ,因此不需要近似值。顺便说一句,这个内核来自于时间相关的高斯函数 G A s s A n X ; t ; α = 1 4παt e -X 2 4αt 随着时间的推移而扩散。将其与解的初始状态进行卷积的效果具有平滑效果,可以扩散并平滑任何初始振荡和尖峰。我们通过任何可以想象的扩散过程来观察这种平滑,例如烟雾在空气中的扩散,或染料在液体中的扩散,其中物质平滑地扩散,直到我们获得一种看起来均匀的介质。

In our simple example, we leveraged linearity and constant coefficients to incorporate Fourier transform methods and convolution when reverting back to real space, and had the luxury of working out an explicit analytic formula of the kernel of the integral, namely, k ( s , x ; t ; α ) = 1 4παt e -(s-x) 2 4αt , so there is no need for approximations. On a nice side note, this kernel comes from a time-dependent Gaussian function G a u s s i a n ( x ; t ; α ) = 1 4παt e -x 2 4αt that spreads out as time evolves. The effect of convolving this with the initial state of the solution has a smoothing effect that spreads out and smooths any initial oscillations and spikes. We observe this smoothing with any diffusion process that we can visualize, such as diffusion of smoke in the air, or diffusion of a dye in a liquid, where the substance spreads out smoothly until we obtain one homogeneous-looking medium.

使用泊松方程的示例

Example Using the Poisson Equation

A有界域上的泊松方程如下所示:

A Poisson equation on a bounded domain looks like:

- A X X = F X 为了 X ε D X = 0 为了 X 边界 D

什么时候 A X 是常数并且域是二维的,则变为:

When a ( x ) is constant and the domain is two-dimensional, this becomes:

- A Δ X , y = F X , y 为了 X , y ε D 2 X , y = 0 为了 X , y 边界 D

在哪里 Δ X , y = XX X , y + yy X , y 我们可以在xy中使用傅立叶变换,就像我们对热方程(具有常数系数的线性方程)所做的那样,但让我们演示一下格林函数方法。我们可以将偏微分方程的右侧视为位置 ( x , y ) 处强度为f ( x , y ) 的脉冲连续统中的聚合。我们需要狄拉克δ测度 δ X,y s , p 以数学方式表达脉冲的概念。除(x,y)点外,域上的所有位置均为零,该点为无穷大,且域上的总测度归一化为 1。这里的基本原理是,我们可以求解偏微分方程,其右侧由仅在某个位置产生脉冲,然后将这些脉冲聚合为原始解决方案。据推测,仅用脉冲作为右侧来求解偏微分方程比用某个给定函数作为右侧来求解偏微分方程更容易,因此我们将从解 G(x,y;s) 构建解 u(x , y ) ,p)的脉冲偏微分方程。更重要的是,使用格林函数使我们能够获得输入数据针对内核(即格林函数)的解的积分表示。右手边带有脉冲的偏微分方程为:

where Δ u ( x , y ) = u xx ( x , y ) + u yy ( x , y ) . We can employ the Fourier transform in x and y like we did for the heat equation (linear equation with constant coefficients), but let’s demonstrate the Green’s function method instead. We can think of the righthand side of the PDE as an aggregation in the continuum of impulses of intensity f(x,y) at locations (x,y). We need the Dirac delta measure δ (x,y) ( s , p ) to express the concept of an impulse mathematically. This is zero everywhere on the domain except at the point (x,y), where it is infinite, and its total measure on the domain is normalized to 1. The rationale here is that we can solve the PDE with the righthand side consisting of only an impulse at a certain location, then aggregate the original solution from these. Presumably, solving the PDE with only an impulse as the righthand side is easier than solving it with some given function as the righthand side, so we will build up the solution u(x,y) from the solution G(x,y;s,p) of the impulse PDE. More importantly, using the Green’s function allows us to get an integral representation of the solution of the input data against a kernel (which is the Green’s function). The PDE with an impulse righthand side is:

- A Δ G s , p ; X , y = δ X,y s , p 为了 s , p ε D 2 G s , p ; X , y = 0 为了 s , p 边界 D

现在我们来写:

Let’s now write:

F X , y = D F s , p δ X,y s , p d s d p

偏微分方程为:

and the PDE as:

- A Δ X , y = D F s , p δ X,y s , p d s d p 为了 X , y ε D 2 X , y = 0 为了 X , y 边界 D

让我们替换为 δ X,y s , p 里面的积分与 - A Δ G s , p ; X , y :

Let’s substitute the δ (x,y) ( s , p ) inside the integral with - a Δ G ( s , p ; x , y ) :

- A Δ X , y = D - A Δ G s , p ; X , y F s , p d s d p 为了 X , y ε D 2 X , y = 0 为了 X 边界 D

现在假设我们有适当的条件来交换微分和积分:

Now let’s assume that we have the right conditions to swap differentiation and integration:

- A Δ X , y = - A Δ D G s , p ; X , y F s , p d s d p 为了 X , y ε D 2 X , y = 0 为了 X , y 边界 D

最后,这使我们能够将解u(x,y)表示为:

Finally, this allows us to represent the solution u(x,y) as:

X , y = D G X , y ; s , p ; A F s , p d s d p

请注意,我们将G依赖于G中的显。在本章后面,当我们想要学习神经网络设置中偏微分方程的解算子时,物理参数a将成为网络输入的一部分。如果参数a = a ( s , p ) 不是常数,那么我们将编写G(x,y;s,p;a(s,p))。与前面示例的讨论类似,前面计算的要点是:

Note that we made the dependency of G on a explicit in G. Later in this chapter, when we want to learn the solution operator of the PDE in a neural network setting, the physical parameter a would be part of the network’s input. If the parameter a = a(s,p) is not constant, then we would write G(x,y;s,p;a(s,p)). Analogous to the discussion of the previous example, the punch line from the previous calculation is:

偏微分方程的解u ( x , y ) 是某个核函数的积分,在本例中为格林函数 G s , p ; X , y ; A 反对偏微分方程的右侧 F s , p

The solution u(x,y) of the PDE is the integral of some kernel function, in this case the Green’s function G ( s , p ; x , y ; a ) against the righthand side of the PDE f ( s , p ) .

此外,偏微分方程的解算子映射给定的输入数据,在本例中是参数 A 和偏微分方程的右侧 F X , y ,通过将右侧函数与某个取决于 PDE 参数(及其对空间的依赖性)的核函数积分,得到输出,即我们正在寻求的解u ( x , y )。在我们的例子中,解算子的核心是偏微分方程的格林函数,我们碰巧在具有简单几何形状的域上的泊松方程中知道它,但在更复杂的情况下则不然。再次,了解该内核的公式,或使用神经网络对其进行近似,可以解开给定偏微分方程。

Moreover, the solution operator of the PDE maps the given input data, which in this case is the parameter a and the righthand side of the PDE f ( x , y ) , to the output, which is the solution we are seeking, u(x,y), by integrating the righthand side function against some kernel function that depends on the parameter of the PDE (along with its dependence on space). In our case, the kernel of the solution operator is the Green’s function of the PDE, which we happen to know for the Poisson equation on domains with an easy geometry, but not in more complex situations. Once again, knowing the formula of this kernel, or using a neural network to approximate it, unlocks the solution of a given PDE.

定点迭代

Fixed Point Iteration

定点迭代对于构造显式解并证明某些幸运偏微分方程的存在性和唯一性非常有用。这是一种简单而通用的方法,因此是我们工具箱中的必备工具。我们将其写下来,然后立即应用它来表示解决方案作为一个系列的动力系统。动力系统是一个常微分方程,描述一个粒子或一群粒子(系统)随时间的演化。同样,我们希望神经网络能够学习动力系统的解算子,因此这与我们之前的讨论是一致的。此外,最好将解决方案的定点迭代序列表示与神经网络表示并列。回想一下,在许多数学设置中,我们可以用多种方式表示相同的解决方案。定点迭代序列是加性的,而神经网络表示是组合的。此外,神经网络似乎具有代表整个偏微分方程组和更广泛种类的解算子的优势,这在该领域就像梦想成真一样。

The fixed point iteration is useful to construct explicit solutions and prove existence and uniqueness for certain lucky PDEs. It is such an easy and general method, so a definite must-have in our toolbox. We will write it down, then immediately apply it to represent the solution of a dynamical system as a series. A dynamical system is an ordinary differential equation that describes the evolution in time of a particle or a bunch of particles (a system). Again, we would like neural networks to learn the solution operators of dynamical systems, so this is consistent with our previous discussion. Moreover, it is good to have the fixed point iteration series representation of the solution side by side with the neural network representation. Recall that in many mathematical settings, we can represent the same solution in multiple ways. A fixed point iteration series is additive, while a neural network representation is compositional. Moreover, neural networks seem to have the advantage of representing the solution operators of whole families of PDEs and an overall wider variety, which is like a dream come true in this field.

不动点迭代的目的是找到函数的不动点,或者说是一个点 X * 该函数映射回其自身: F X * = X * 。这不是一个简单的任务,因为f通常是非线性的,并且大多数时候我们不知道给定函数是否存在这样的点。与非线性方程的情况一样,迭代方法避免一次性解决方案,而是提出一系列点,希望在正确的条件下收敛到所需的解决方案,在本例中为函数的不动点。

The fixed point iteration aims to find a fixed point of a function, or a point x * that the function maps back to itself: f ( x * ) = x * . This is not a simple task since f is usually nonlinear, and most of the time we have no idea if such points exist for a given function. As is always the case with nonlinear equations, iterative methods avoid one-shot solutions and instead come up with a sequence of points that hopefully and under the right conditions converge to the desired solution, in this case, the fixed point of a function.

它是如何工作的?

How does it work?

以下是定点迭代的过程:

Here is how the fixed point iteration goes:

  1. X 0 是起点

  2. x 0 is the starting point

  3. X +1 = F X

  4. x i+1 = f ( x i )

就是这样。我们的顺序 { X 0 , X 1 , X 2 } 由f的连续应用生成,看起来像 { X 0 , F X 0 , F F X 0 , F F F X 0 , } 在f和的正确条件下 X 0 ,这个序列收敛到一个固定点 X * f 所以 F X * = X * )。

That’s it. Our sequence { x 0 , x 1 , x 2 } is generated by consecutive applications of f, and looks like { x 0 , f ( x 0 ) , f ( f ( x 0 ) ) , f ( f ( f ( x 0 ) ) ) , . . . } . Under the right conditions on f and x 0 , this sequence converges to a fixed point x * of f (so f ( x * ) = x * ).

请注意,根据f和起点,该序列的渐近行为可以是以下任意一种:

Note that depending on f and the starting point, the asymptotic behavior of this sequence can be any of these:

收敛到极限 X *
Convergence to a limit x *

如果定点迭代收敛,则它捕获一个不动点(对于连续函数f;如果定点迭代收敛,则极限必须是f的不动点)。

If the fixed point iteration converges, then it captures a fixed point (for a continuous function f; if the fixed point iteration converges, then the limit must be a fixed point of f).

发散至 无穷大
Divergence to

序列无限增长。

The sequence grows without a bound.

周期性行为
Periodic behavior

该序列在两个或多个值之间振荡。

The sequence oscillates between two or more values.

行为混乱
Chaotic behavior

该序列的行为不稳定,没有任何模式。

The sequence behaves erratically, with no pattern whatsoever.

与定点迭代相关的定理断言,选择 X 0 ,定点迭代的起点,关系到它是否收敛到定点。

Theorems related to the fixed point iteration assert that the choice of x 0 , the starting point for the fixed point iteration, matters for whether it converges to a fixed point or not.

我们如何用它来求解 ODE 和 PDE?

How do we use it to solve ODEs and PDEs?

本章中我们关注的是寻找微分方程的解,微分方程是函数。因此,我们首先重新表述一个偏微分方程,使其解u满足一个类似于F ( u ) = u的方程(请注意,这里的F是一个算子而不是一个函数),使其非常适合定点迭代设置。然后我们将应用与前面描述的相同的逻辑并构造一系列函数,希望在正确的条件下收敛到不动点 * 算子的,这就是我们正在寻找的 PDE 的解。请注意,在前面的讨论中,我们构造了一个数字序列(而不是函数),希​​望在正确的条件下,该序列能够收敛到函数而不是运算符)的固定点。

Our concern in this chapter is finding solutions of differential equations, which are functions. So we will first reformulate a PDE so that its solution u satisfies an equation that looks like F(u) = u (note that F here is an operator not a function), making it perfect for a fixed point iteration setting. Then we will apply the same logic as described previously and construct a sequence of functions that hopefully, under the right conditions, converge to the fixed point u * of the operator, which is the solution of the PDE that we are looking for. Note that in the previous discussion, we constructed a sequence of numbers (rather than functions) that hopefully, under the right conditions, converge to a fixed point of a function (rather than an operator).

让我们使用动态系统设置来演示这一点。这是最重要、最通用、研究最透彻的常微分方程之一,它描述了空间中一点的时间演化。它很容易,因为它通常是一阶的,只有一个导数需要消除,而困难是因为它通常是非线性的。人们习惯于在一点附近对动力系统进行线性化并研究其线性化行为。在许多情况下,这可以提供有关非线性行为的信息,但不应将两者混为一谈。人们对线性系统了解得太多,而对非线性系统知之甚少,因此我们需要对非线性系统给予同样的关注。在本节中,我们不进行线性化。相反,我们使用由定点迭代构造的级数来近似解。

Let’s demonstrate this using a dynamical system setting. This is one of the most important, general, and well-studied ordinary differential equations, which describes the time evolution of a point in space. It is easy in the sense that it is usually first order with only one derivative to get rid of, and hard in the sense that it is generally nonlinear. There is a habit of linearizing the dynamical system near a point and studying its linearized behavior. This is, in many cases, informative about the nonlinear behavior, but the two should not be conflated. Too much is known about linearized systems and not much is known about nonlinear systems, so we need equal attention to nonlinear systems. In this section we do not linearize. Instead, we approximate the solution using a series constructed by a fixed point iteration.

知道点的初始状态 t 0 = 0 ,解轨迹 t 跟踪其所有未来状态。功能 F t , A t , t 指定演变:

Knowing the initial state of the point u ( t 0 ) = u 0 , a solution trajectory u ( t ) tracks all of its future states. The function f ( u ( t ) , a ( t ) , t ) specifies the evolution:

d t dt = F t , A t , t t 0 = 0

我们需要消除一个关于时间的导数,因此我们对时间进行一次积分:

There is one derivative with respect to time that we need to get rid of, so we integrate once with respect to time:

t = 0 + t 0 t F s , A s , s d s

现在让我们以适合定点迭代的形式重写这个积分方程。我们将整个右侧视为一个运算符,其输入为 t :

Now let’s rewrite this integral equation in a form that is ripe for the fixed point iteration. We consider the whole righthand side as an operator whose input is u ( t ) :

t = F t

现在我们可以生成序列 { 0 t , 1 t , 2 t , 3 t , } 收敛到解 t ,在f 的正确条件下,对于所有时间或有限时间。该序列如下所示:

Now we can generate the sequence { u 0 ( t ) , u 1 ( t ) , u 2 ( t ) , u 3 ( t ) , } that converges to the solution u ( t ) , for all time or for a finite amount of time, under the right conditions on f. The sequence looks like:

  • 0 t = 0 t

  • u 0 ( t ) = u 0 ( t )

  • 1 t = F 0 t = 0 + t 0 t F 0 s , A s , s d s

  • u 1 ( t ) = F ( u 0 ( t ) ) = u 0 + t 0 t f ( u 0 ( s ) , a ( s ) , s ) d s

  • 2 t = F 1 t = 0 + t 0 t F 1 s , A s , s d s

  • u 2 ( t ) = F ( u 1 ( t ) ) = u 0 + t 0 t f ( u 1 ( s ) , a ( s ) , s ) d s

  • 3 t = F 2 t = 0 + t 0 t F 2 s , A s , s d s

  • u 3 ( t ) = F ( u 2 ( t ) ) = u 0 + t 0 t f ( u 2 ( s ) , a ( s ) , s ) d s

等等。

and so on.

简单但信息丰富的示例

Simple but very informative example

最好的例子是那些足够简单、有多种解决方法的例子。同时看到做同一件事的多种方法有助于巩固新学到的方法的要点。考虑非常简单的一维线性动力系统:

The best examples are the ones that are simple enough to have multiple ways of solving them. Seeing more than one way to do the same thing at once helps solidify the gist of a newly learned method. Consider the very simple one-dimensional and linear dynamical system:

dt dt = t 0 = 1

解决这个问题的第一种方法是通过变量分离,我们将包含u(t)的所有内容放在方程的一侧,将仅包含t的所有内容放在方程的另一侧:

The first way to solve this is by separation of variables, where we put everything that has u(t) on one side of the equation, and everything that has t alone on the other side:

dt t = d t

现在我们可以从0t进行积分:

Now we can integrate from 0 to t:

0 t ds s = 0 t d s

我们得到 t = t ,因此我们使用分离变量方法的简单动力系统的解是 t = e t (可以说是数学中最重要的函数)。现在让我们使用定点迭代构造一个函数序列,看看它是否收敛到解 t = e t 动力系统的:

We get ln ( u ( t ) ) = t , therefore the solution of our simple dynamical system using the separation of variables method is u ( t ) = e t (arguably the most important function in mathematics). Now let’s construct a sequence of functions using the fixed point iteration and see if it converges to the solution u ( t ) = e t of the dynamical system:

  • 0 t = 1

  • u 0 ( t ) = 1

  • 1 t = F 0 t = 0 t + 0 t 0 s d s = 1 + 0 t 1 d s = 1 + t

  • u 1 ( t ) = F ( u 0 ( t ) ) = u 0 ( t ) + 0 t u 0 ( s ) d s = 1 + 0 t 1 d s = 1 + t

  • 2 t = F 1 t = 0 t + 0 t 1 s d s = 1 + 0 t 1 + s d s = 1 + t + t 2 2

  • u 2 ( t ) = F ( u 1 ( t ) ) = u 0 ( t ) + 0 t u 1 ( s ) d s = 1 + 0 t 1 + s d s = 1 + t + t 2 2

  • 3 t = F 2 t = 0 t + 0 t 2 s d s = 1 + 0 t 1 + s + s 2 2 d s = 1 + t + t 2 2 + t 3 3

  • u 3 ( t ) = F ( u 2 ( t ) ) = u 0 ( t ) + 0 t u 2 ( s ) d s = 1 + 0 t 1 + s + s 2 2 d s = 1 + t + t 2 2 + t 3 3!

  • 继续前进:

    n t = F n-1 t = 0 t + 0 t n-1 s d s = 1 + t + t 2 2 + t 3 3 + + t n n

  • Keep going:

    u n ( t ) = F ( u n-1 ( t ) ) = u 0 ( t ) + 0 t u n-1 ( s ) d s = 1 + t + t 2 2 + t 3 3! + + t n n!

作为 n 无穷大 ,定点迭代收敛到级数:

As n , the fixed point iteration converges to the series:

无穷大 t = 1 + t + t 2 2 + t 3 3 + + t n n + = Σ n=0 无穷大 t n n

这是幂级数展开 t = e t ,我们使用变量分离得到的解决方案相同(尽管形式不同)。凉爽的。

which is the power series expansion of u ( t ) = e t , the same solution we arrived at using separation of variables (albeit in different form). Cool.

当我们使用这种迭代方式构建动力系统的解决方案时,或者将偏微分方程重新表述为动力系统或以适合定点迭代的方式构建解决方案时( u=F(u )它是皮卡德的迭代。它很简单,并且逐步得出解决方案(当它收敛时)。

When we use this iterative way to construct a solution of a dynamical system, or of a PDE reformulated as a dynamical system or in a way that is fit for a fixed point iteration (u=F(u)), we call it Picard’s iteration. It is simple and arrives at the solution (when it converges) in steps.

并发症在哪里?

Where is the complication?

为什么我们不使用皮卡德迭代来构造所有动力系统和所有偏微分方程的解,我们能够将其重新表述为适合定点迭代的形式?一如既往,答案是维度的诅咒。即使对于我们非常简单的一维和线性示例,每个皮卡德迭代步骤都涉及评估积分,对于更复杂的问题,我们必须进行数值评估。例如,对于代表许多粒子的演化和相互作用的动力系统,该值乘以粒子的数量。总体而言,在 ODE 和 PDE 文献中,可用于高维设置的实用算法的情况数量有限。

Why don’t we use Picard’s iteration to construct solutions of all dynamical systems and of all PDEs that we are able to reformulate into a form fit for fixed point iteration? As always, the answer is the curse of dimensionality. Even for our very simple one-dimensional and linear example, each Picard iteration step involves evaluating an integral, which for more complex problems we have to evaluate numerically. For example, for dynamical systems representing the evolution and interactions of many particles, this gets multiplied by the number of particles. Overall in the ODE and PDE literature, there is a limited number of cases where practical algorithms are available for high-dimensional settings.

最近的成功!

Recent successes!

也就是说,最近一种基于皮卡德迭代寻找高维非线性抛物型偏微分方程和后向随机微分方程显式解的方法在为现实物理和金融应用中出现的高维偏微分方程找到显式解方面非常成功。该论文(Weinan et al. 2017)的摘要很有洞察力:

That said, a recent method for finding explicit solutions for high-dimensional nonlinear parabolic PDEs and backward stochastic differential equations based on Picard’s iteration has been quite successful in finding explicit solutions for high-dimensional PDEs arising in real-life physics and finance applications. The paper’s (Weinan et al. 2017) abstract is insightful:

抛物型偏微分方程 (PDE) 和后向随机微分方程 (BSDE) 是许多物理和金融工程模型的关键成分。特别是,抛物线偏微分方程和倒向微分方程是最先进的金融衍生品定价和对冲的基本工具。此类应用中出现的偏微分方程和倒向微分方程通常是高维和非线性的。由于此类偏微分方程和倒向随机微分方程的显式解通常无法获得,因此近似求解此类偏微分方程和倒向随机微分方程是一个非常活跃的研究课题。在最近的文章中 [E, W., Hutzenthaler, M., Jentzen, A., 和 Kruse, T. 用于求解高维非线性抛物型微分方程的线性缩放算法。arXiv:1607.03295 (2017)]我们提出了一系列基于皮卡德近似和多级蒙特卡罗方法的近似方法,并表明在半线性热方程精确解的适当规律性假设下,计算复杂度受下式限制: d ε -4+δ 对于任何 δ ε 0 , 无穷大 ,其中 d 是问题的维数, ε ε 0 , 无穷大 是规定的精度。在本文中,我们通过数值模拟来测试该算法在物理和金融领域出现的各种 100 维非线性偏微分方程上的适用性,显示针对运行时间的近似精度。这些 100 维示例偏微分方程的模拟结果在精度和速度方面都非常令人满意。此外,我们还从文献中回顾了非线性 PDE 和 BSDE 的其他近似方法。

Parabolic partial differential equations (PDEs) and backward stochastic differential equations (BSDEs) are key ingredients in a number of models in physics and financial engineering. In particular, parabolic PDEs and BSDEs are fundamental tools in the state-of-the-art pricing and hedging of financial derivatives. The PDEs and BSDEs appearing in such applications are often high-dimensional and nonlinear. Since explicit solutions of such PDEs and BSDEs are typically not available, it is a very active topic of research to solve such PDEs and BSDEs approximately. In the recent article [E, W., Hutzenthaler, M., Jentzen, A., and Kruse, T. Linear scaling algorithms for solving high-dimensional nonlinear parabolic differential equations. arXiv:1607.03295 (2017)] we proposed a family of approximation methods based on Picard approximations and multilevel Monte Carlo methods and showed under suitable regularity assumptions on the exact solution for semilinear heat equations that the computational complexity is bounded by O ( d ϵ -(4+δ) ) for any δ ( 0 , ) , where d is the dimensionality of the problem and ϵ ( 0 , ) is the prescribed accuracy. In this paper, we test the applicability of this algorithm on a variety of 100-dimensional nonlinear PDEs that arise in physics and finance by means of numerical simulations presenting approximation accuracy against runtime. The simulation results for these 100-dimensional example PDEs are very satisfactory in terms of accuracy and speed. In addition, we also provide a review of other approximation methods for nonlinear PDEs and BSDEs from the literature.

为偏微分方程的深度学习奠定基础

Setting the stage for deep learning for PDEs

离开本节,让我们为在深度学习背景下求解 ODE 和 PDE 做好准备,特别是对于深度算子网络。我们将保留一维动力系统示例,但这次我们强调对物理参数a(t)的依赖性,并通过添加另一个对时间的显式依赖性使其更加通用:

Before leaving this section, let’s set the stage for solving ODEs and PDEs in the context of deep learning, in particular, for deep operator networks. We’ll keep our one-dimensional dynamical system example, but this time we highlight the dependence on the physical parameters a(t), and make it slightly more general by adding another explicit dependence on time:

dt dt = F t , A t , t t 0 = 0

和以前一样,我们对时间积分一次:

As before, we integrate once with respect to time:

t = 0 + t 0 t F s , A s , s d s

神经网络的目的是将数据作为输入,对其进行处理,然后给出我们关心的输出。对于 ODE 或 PDE,我们关心的输出当然是解 t 让我们将此解写为某个运算符G的输出,该运算符将 ODE 或 PDE 的给定数据作为输入。在我们的动力系统案例中,输入数据是代表动力系统物理参数的函数a(t) 。请注意,我们不需要输入动力系统的右侧函数f 。这是隐含在训练数据中的,现在看起来像对(训练输入,训练输出) = A t , t 。通过不输入f,在某种程度上我们是说我们不关心这种行为来自的 ODE 或 PDE 的确切形式,但无论如何我们都能够了解系统正在做什么。这就是机器学习的缩影:不需要对系统遵守的规则进行编码;如果模型观察到足够多的实例,它仍然可以模拟它。

The purpose of a neural network is to take data as input, do something to it, then give us an output that we care about. For an ODE or a PDE, of course the output that we care about is the solution u ( t ) . Let’s write this solution as the output of some operator G that takes the given data of an ODE or PDE as input. In our dynamical system case, the input data is the function a(t) representing the physical parameters of the dynamical system. Note that we do not need to input the righthand side function f of the dynamical system. This is implicit in the training data, which now looks like the pairs (training input,training output)= ( a ( t ) , u ( t ) ) . By not inputting f, in a way we are saying that we don’t care about the exact form of the ODE or PDE that this behavior comes from, but we are able to learn what the system is doing, regardless. This is the epitome of machine learning: no need to encode the rules that the system obeys; the model can still emulate it if it observes enough instances of it.

我们可以写出解决方案 t = G A t ,其中解算子 G将使用神经网络来学习。保留这个符号和想法直到本章稍后讨论神经算子网络时。堵漏 t = G A t 代入积分方程,我们发现我们想要使用神经网络学习的解算子满足:

We can write the solution u ( t ) = G ( a ( t ) ) , where the solution operator G is to be learned using a neural network. Hold this notation and thought until later in this chapter when we discuss neural operator networks. Plugging u ( t ) = G ( a ( t ) ) into the integral equation, we find that the solution operator, which we want to learn using a neural network, satisfies:

G A t = 0 + t 0 t F G A s , A s , s d s

我们只是写了一个积分方程,我们不会对其做任何事情。它只是显示了实体G(a(t))的真实属性我们关心的在前面的讨论中,我们使用皮卡德迭代来近似它,而在深度学习的新时代,我们使用深度算子网络来近似它(稍后将详细介绍)。如果我们使用傅里叶变换来加速计算,这种深度学习方法的计算效率会更高。此外,深度学习方法的包容性更加广泛,因为它不仅适用于动力系统,还适用于更多的偏微分方程和常微分方程。动态系统很容易积分一次并获得使我们更接近解决方案的表示,但许多 ODE 并非如此和偏微分方程。

We just wrote an integral equation that we will not do anything with. It just shows the true property that the entity G(a(t)) that we care for satisfies. In the previous discussion, we approximated this using a Picard’s iteration, and in the new era of deep learning, we approximate it using a deep operator network (more on this soon). This deep learning approach is computationally more efficient if we include the Fourier transform to speed up the computations. Moreover, the deep learning approach is more widely encompassing, in the sense that it applies to more PDEs and ODEs than just dynamical systems. A dynamical system is easy to integrate once and obtain a representation that gets us closer to the solution, which is not the case for many ODEs and PDEs.

网格独立性和不同的分辨率

Mesh independence and different resolutions

最后一点:学习动态系统解决方案的神经网络的输入和输出对如下所示:(训练输入,训练输出) = A t , t 。由于机器只接受数值而不是函数,因此我们在实现时必须离散化。这是一个漂亮的事情,它将作用于函数的算子作用于点的函数区分开来,并赋予神经算子网络其网格独立特征:a(t)u(t)不必在相同的t值下离散化。我们关心的只是将一个函数映射到另一个函数,因此我们可以将离散化a(t)视为映射到另一个向量(即离散化u(t) )的向量,不一定位于相同的点,甚至不一定是相同的点尺寸。出于同样的原因,我们可以在给定的分辨率下训练网络,然后在另一个分辨率下进行预测。这对于常微分方程和偏微分方程领域来说非常有用,因为数值解的质量始终受到所采用的分辨率的限制离散化。

One final note: an input and output pair for a neural network learning the solution of our dynamical system looks like: (training input,training output) = ( a ( t ) , u ( t ) ) . Since machines take numerical values only and not functions, we must discretize when we implement this. Here’s the pretty thing that differentiates operators acting on functions from functions acting on points and gives neural operator networks their mesh independence feature: a(t) and u(t) do not have to be discretized at the same values of t. All we care about is to map one function to another, so we can think of discretized a(t) as a vector mapped to another vector that is the discretized u(t), not necessarily at the same points or not even of the same size. For the same reason, we can train the network at a given resolution, then make predictions at another resolution. This is great for the field of ODEs and PDEs, where the quality of numerical solutions has always been limited by the resolution of the employed discretization.

偏微分方程的人工智能

AI for PDEs

勘察后解决偏微分方程的主要问题和基本方法之后,我们终于准备好讨论人工智能与偏微分方程的关系,而不是仅仅暗示它或在这里或那里设置它的舞台。我们想要区分深度学习进入 PDE 社区的几种不同方式:

After surveying the main concerns and the basic approaches for solving PDEs, we are finally ready to discuss AI as it relates to PDEs, instead of only hinting at it or setting its stage here and there. We want to distinguish between a few different ways in which deep learning has stepped into the PDE community:

  • 深度学习学习偏微分方程的物理参数值

  • Deep learning to learn the PDE’s physical parameter values

  • 通过深度学习学习二维和三维网格以进行数值模拟和实体建模

  • Deep learning to learn two-dimensional and three-dimensional meshes for numerical simulations and solid modeling

  • 深度学习学习偏微分方程的解算子:神经网络学习两个无限维空间之间的映射

  • Deep learning to learn a PDE’s solution operator: a neural network learns a map between two infinite dimensional spaces

  • 深度学习绕过偏微分方程并直接从观测数据模拟自然现象(粒子系统及其相互作用)

  • Deep learning to bypass PDEs and simulate natural phenomena directly from observing data (particle systems and their interactions)

通过深度学习学习物理参数值

Deep Learning to Learn Physical Parameter Values

我们可以使用神经网络推断 PDE 模型的参数及其不确定性。我们从实验中获取训练数据(通过使用已知参数模拟众所周知的现象来获得真实的或合成的数据)。该训练数据标有参数值,因此神经网络学习将某个偏微分方程的初始设置映射到适当的参数值,从而获得更准确的建模结果。从历史上看,无法直接测量的参数必须猜测或手动调整以适应某些观察到的行为,这种做法会破坏整个建模过程。深度学习的这种简单应用为 PDE 建模社区带来了巨大帮助,因为它为他们的结果带来了更多的真实性。我们现在可以从实验的标记图像、录制的音频和其他非结构化或非常高维的数据中学习参数值。经过训练,神经网络可以估计具有类似设置的任何输入数据的参数和不确定性。这张海报有一个很好且简单的示例,它使用深度学习使用火焰前图像数据来预测 G 方程(模拟燃烧过程)的速度场参数:“基于物理的非线性火焰模型中的贝叶斯推理

We can use neural networks to infer the parameters of a PDE model and their uncertainties. We get training data from experiments (real or synthetic via simulations of well-known phenomena with known parameters). This training data is labeled with the parameter values, so the neural network learns to map a certain PDE’s initial setting to appropriate parameter values, leading to more accurate modeling results. Historically, parameters that couldn’t be directly measured had to be guessed or hand tuned to fit some observed behavior, a practice that undermines the whole modeling process. This simple application of deep learning helps the PDE modeling community tremendously, because it brings more authenticity to their results. We can now learn parameter values from labeled images of experiments, recorded audio, and other unstructured or very high-dimensional data. Once trained, the neural networks can estimate parameters and uncertainties for any input data with similar settings. This poster has a nice and simple example that uses deep learning to predict the parameters of the velocity field for the G-equation (which models the combustion process) using flame front-image data: “Bayesian Inference in Physics-Based Nonlinear Flame Models”.

深度学习学习网格

Deep Learning to Learn Meshes

我们学会了在本章中,生成网格是有限元方法的一个组成部分,有限元方法反过来又为具有复杂域几何形状的自然现象建模的各种偏微分方程找到数值解。底层网格的质量影响数值解的质量。网格越细,它可能捕获的真实解越多,但产生的计算成本也越高。理想的网格在数值解与真实解之间的误差可能较高的情况下是密集的,在误差较低的情况下是粗糙的,因此在保持总体计算成本可控的同时保持保真度(图13-12

We learned in this chapter that generating a mesh is an integral part of the finite element method, which in turn finds numerical solutions for a wide array of PDEs modeling natural phenomena with complex domain geometries. The quality of the underlying mesh affects the quality of the numerical solution. The finer the mesh, the more of the true solution it is likely to capture, but the more computational cost it incurs. An ideal mesh would be dense where the error between the numerical solution and the true solution is more likely to be high, and coarse where the error is low, hence preserving fidelity while keeping the overall computational cost manageable (Figure 13-12).

电子邮件 1312
图 13-12。左侧为非均匀网格,右侧为均匀网格;误差较大的地方需要更细的网格(图片来源

如果给定偏微分方程、其域几何、边界条件和参数值作为输入,我们可以训练神经网络自动生成理想的网格,预测域中每个位置的网格元素的密度分布,那就太好了。这正是MeshingNet所做的。

It would be nice if, given a PDE, its domain geometry, boundary conditions, and parameter values as inputs, we could train a neural network to automatically generate an ideal mesh, predicting the density distribution of the mesh elements at each location of the domain. This is exactly what MeshingNet does.

MeshingNet,网格学习是通过昂贵的多步有限元解决方案和误差估计器完成的。相比之下,MeshingNet 依靠类似的问题来预测新问题的理想网格。它从初始均匀且粗糙的网格开始,并预测非均匀网格密度以进行细化。作为深度学习的标志,MeshingNet 可以很好地推广到具有各种控制偏微分方程、边界条件和参数的不同几何域。

Before MeshingNet, mesh learning was done via expensive multistep finite element solutions and error estimators. In contrast, MeshingNet relies on similar problems to predict ideal meshes for new problems. It starts with an initial uniform and coarse mesh, and predicts a nonuniform mesh density for refinement. A hallmark of deep learning, MeshingNet generalizes well to different geometric domains with various governing PDEs, boundary conditions, and parameters.

MeshingNet 的输入是控制 PDE、PDE 参数、域几何结构和边界条件,输出是整个域的区域上限分布A ( X )。输入和输出之间的映射是高度非线性的,因此可以通过神经网络进行学习,神经网络已表现出表达多种非线性关系的令人印象深刻的能力。

The inputs to MeshingNet are the governing PDE, PDE parameters, domain geometry, and the boundary conditions, and the output is an area upper-bound distribution A(X) over the whole domain. The mapping between input and output is highly nonlinear and is thus learned by a neural network, which has demonstrated an impressive ability to express many kinds of nonlinear relationships.

为了构建训练数据集,MeshingNet 团队使用标准有限元求解器在高密度均匀网格上计算高精度解决方案。他们还对低密度均匀网格进行相同的计算,以获得较低精度的解决方案。然后团队通过在这些解决方案之间进行插值来计算误差分布E ( X )。他们使用E ( X ) 作为改进A ( X )的指南。他们通过将不同的几何形状与不同的参数和边界条件相结合来丰富训练数据。

To build the training data set, the MeshingNet team computes high-accuracy solutions on high-density uniform meshes using standard finite element solvers. They also do the same computation for low-density uniform meshes to obtain lower-accuracy solutions. Then the team computes an error distribution E(X) by interpolating between these solutions. They use E(X) as a guide to refine A(X). They enrich the training data by combining different geometries with different parameters and boundary conditions.

三维网格的深度学习

Deep learning for three-dimensional meshes

三维网格(图13-13)是对于计算机图形学、娱乐行业动画和实体建模很有用。他们还非常希望从给定的一组三维数据点重建纹理和真实的表面。传统方法包括 Delaunay 三角剖分和 Voronoi 图,它们使用三角形网格对点进行插值。然而,当坐标中存在噪声时,所得表面将变得不必要的粗糙,这需要数据预处理。

Three-dimensional meshes (Figure 13-13) are useful for computer graphics, animations for the entertainment industry, and solid modeling. They are also highly desirable to reconstruct textured and realistic surfaces from a given set of three-dimensional data points. Traditional methods include Delaunay triangulations and Voronoi diagrams, which interpolate points using triangular meshes. However, when there is noise in the coordinates, the resulting surface would be unnecessarily rough, which requires data preprocessing.

电子邮件 1313
图 13-13。三维网格(图片来源

深度学习正在介入生成更高质量的三维网格;例如,参见“Deep Hybrid Self-Prior for Full 3D Mesh Generation”(Wei et al. 2021)“Pixel2Mesh”(Wang et al. 2018),它从单一颜色的三角形网格中生成三维形状图像不断变形一个椭球体。

Deep learning is stepping in to generate higher-quality three-dimensional meshes; see, for example, “Deep Hybrid Self-Prior for Full 3D Mesh Generation” (Wei et al. 2021) and “Pixel2Mesh” (Wang et al. 2018), which produces a three-dimensional shape in triangular mesh from a single color image by continuously deforming an ellipsoid.

深度学习逼近偏微分方程的解算子

Deep Learning to Approximate Solution Operators of PDEs

我们有在本章中已经多次开始这个讨论。我们不想使用深度学习来增强偏微分方程的现有方法,例如从数据中学习物理参数值,或学习更好的数值方法网格,而是学习偏微分方程的解算子。这将偏微分方程的输入(例如其域、物理参数、解的初始/最终状态和/或边界条件)直接映射到其解。我们可以将其视为:

We have already started this discussion multiple times in this chapter. Instead of using deep learning to enhance existing methods for PDEs, such as learning physical parameter values from data, or learning better meshes for numerical methods, we would like to learn a PDE’s solution operator. This maps the PDE’s input, such as its domain, physical parameters, initial/final states of the solution, and/or boundary conditions, directly to its solution. We can think of this as:

  • PDE的解=函数(PDE的物理参数、定义域、边界条件、初始条件等)
  • Solution of PDE = function(PDE’s physical parameters, domain, boundary conditions, initial conditions, etc.)

我们想要建立一个神经网络来近似这个函数。这实际上是一个运算符,而不是通常意义上的函数,因为它将函数发送到其他函数。这里需要注意的是,微分算子及其逆将无限维空间映射到无限维空间,有时以线性方式,例如从泊松方程的右侧到解的映射,并且大多数时候以非线性方式,例如从泊松方程的参数到解的映射。相比之下,我们在本书中学到的神经网络的输入和输出是有限维的(输入和输出是向量、图像、图形等)。这些神经网络能够近似有限维空间之间的函数映射。他们有一个强大的通用逼近定理,并且在实际应用中取得了无数的成功(如果我们对隐藏层的宽度和深度没有限制,我们可以使用神经网络将任何连续函数逼近到任意精度)。为了使用深度学习类似地解决偏微分方程,我们必须回答两个问题:

We want to build a neural network to approximate this function. This is in fact an operator and not a function in the usual sense, since it sends functions to other functions. The caveat here is that differential operators and their inverses map infinite dimensional spaces to infinite dimensional spaces, sometimes in a linear way, such as the map from the righthand side of a Poisson equation to the solution, and most of the time in a nonlinear way, such as the map from the parameters of a Poisson equation to the solution. In contrast, the inputs and outputs of the neural networks that we learned about throughout this book are finite dimensional (inputs and outputs are vectors, images, graphs, etc.). These neural networks are able to approximate function mappings between finite dimensional spaces. They have a powerful universal approximation theorem going on for them, and a myriad of successes in practical applications (we can approximate any continuous function to arbitrary accuracy using neural networks if we place no constraints on the width and depth of the hidden layers). To solve PDEs analogously using deep learning, we must answer two questions:

神经网络可以近似无限维空间之间的映射吗?
Can neural networks approximate mappings between infinite dimensional spaces?

也就是说,它们能否逼近任何非线性连续函数(网络的输入将是一个函数或一组函数,输出将是一个实数)或非线性算子(网络的输入将是一个函数或一个函数)一堆函数,输出将是另一个函数)?答案是肯定的

神经网络算子通用逼近定理就像神经网络函数的通用逼近定理一样。具有单个隐藏层的神经网络可以精确地逼近任何非线性连续函数或算子。此外,神经网络能够学习整个偏微分方程组的解算子,这与每次仅求解给定偏微分方程的单个实例的求解偏微分方程的经典方法相反。

That is, can they approximate any nonlinear continuous functional (the input to the network would be a function or a bunch of functions, and the output would be a real number) or nonlinear operator (the input to the network would be a function or a bunch of functions, and the output would be another function)? The answer is a yes!

There is a universal approximation theorem for neural network operators just like there are universal approximation theorems for neural network functions. A neural network with a single hidden layer can approximate accurately any nonlinear continuous functional or operator. Moreover, neural networks are able to learn the solution operator of an entire family of PDEs, as opposed to classical methods for solving PDEs that only solve a single instance of a given PDE at a time.

我们如何在实践中实施这一点?
How do we implement this in practice?

对于有限维情况,神经网络中的节点线性组合输入向量(或前一层的输出)的有限维特征,添加偏差项,应用非线性激活函数,然后将结果传递给下一层层。无限维情况的模拟,我们不再有有限多个条目进行线性组合,将是对输入函数的一些可学习内核(乘数函数)进行积分(对于数值积分,我们必须在有限多个点上对其进行采样,将积分转换为加法),添加偏置函数(这是可选的),并在将结果传递到下一层之前应用非线性激活函数。然后,下一层将添加上一层节点结果的倍数,并将它们与上一层节点结果的可学习内核进行集成,依此类推。这样做的一个例子如下:

n+1 X = σ D k e r n e X , s , A X , A s ; ω n s d s + n X ,

在针对核、局部线性变换和非线性激活函数的组合进行一定次数的全局积分之后,我们迭代地得出解u(x) 。迭代过程中内核的参数为 ω 和W的条目。神经网络在训练期间通过最小化损失函数从标记数据(用偏微分方程的解标记)中学习这些参数。与有限维情况类似,神经算子网络通过组合线性积分算子来近似非线性算子,这些算子在整个域上通过非线性激活函数全局作用。前面的迭代公式还包括一个局部线性乘子,当我们离散化时,它就变成了一个矩阵。

For the finite dimensional case, a node in a neural network linearly combines the finite dimensional features of the input vector (or the outputs of the previous layer), adds a bias term, applies a nonlinear activation function, then passes the result to the next layer. The analog for the infinite dimensional case, where we do not have finitely many entries to linearly combine anymore, would be to integrate some learnable kernel (multiplier function) of the input functions (for numerical integration, we have to sample this at finitely many points, converting integration to addition), add a bias function (this is optional), and apply a nonlinear activation function before passing the result to the next layer. The next layer would then add multiples of the node results of the previous layer, and integrate them against a learnable kernel of the results of the nodes of the previous layer, and so on. One example of doing this would look like:

u n+1 ( x ) = σ ( D k e r n e l ( x , s , a ( x ) , a ( s ) ; ω ) u n ( s ) d s + W u n ( x ) ) ,

where we arrive at the solution u(x) iteratively after a certain number of global integrations against a kernel, local linear transformations, and compositions with a nonlinear activation function. The parameters of the kernel in the iterative process are ω and the entries of W. The neural network learns these parameters from the labeled data (labeled with the solutions of the PDE) during training by minimizing a loss function. Analogous to the finite dimensional case, neural operator networks approximate nonlinear operators by composing linear integral operators that act globally over the entire domain with nonlinear activation functions. The previous iterative formula also includes a local linear multiplier, which becomes a matrix when we discretize.

神经算子网络用于学习我们导出的解算子

Neural operator networks to learn the solution operators that we derived

让我们暂停一下,将前面的表达式与我们导出的三个真解算子进行比较:热方程、泊松方程和动力系统。我们可以轻松地使神经算子迭代过程适应这三种设置:

Let’s pause for a moment and compare the previous expression to the three true solution operators that we derived for: the heat equation, Poisson equation, and dynamical systems. We can easily adapt the neural operator iterative process to each of these three settings:

  • 对于一维空间和常数系数的热方程,解算子将初始状态和偏微分方程的物理参数(常数)映射到解u(x,t)。我们很幸运,对所有涉及的数量都有明确的公式:

    G 0 X , A = X , t = -无穷大 无穷大 1 4πAt e -s-X 2 4At 0 s d s = -无穷大 无穷大 k e r n e s , X ; t ; A 0 s d s

    在这种情况下,神经算子网络执行以下迭代来逼近真实算子:

    G 0 X , A = X , t n+1 X , t = σ D k e r n e s , X ; t ; A ; ω n s d s + n X
  • For the heat equation in one space dimension and constant coefficients, the solution operator maps the initial state and the PDE’s physical parameter (constant) to the solution u(x,t). We are lucky enough to have explicit formulas for all the quantities involved:

    G ( u 0 ( x ) , a ) = u ( x , t ) = - 1 4πat e -(s-x) 2 4at u 0 ( s ) d s = - k e r n e l ( s , x ; t ; a ) u 0 ( s ) d s

    In this case, the neural operator network does the following iteration to approximate the true operator:

    G ( u 0 ( x ) , a ) = u ( x , t ) u n+1 ( x , t ) = σ ( D k e r n e l ( s , x ; t ; a ; ω ) u n ( s ) d s + W u n ( x ) )
  • 对于二维空间、零边界条件和常数系数的泊松方程,解算子将偏微分方程的右侧f及其物理参数(常数)映射到解u(x,y),并且仅适用于某些简单几何形状。我们很幸运,对所有涉及的数量都有明确的公式(我们没有在这里写):

    G F X , y , A = X , y = D G r e e n F n C t n X , y ; s , p ; A F s , p d s d p

    在这种情况下,神经算子网络执行以下迭代来逼近真实算子:

    G F X , y , A = X , y n+1 X , y = σ D k e r n e X , y , s , p , A ; ω n s , p d s d p + n X , y
  • For a Poisson equation in two space dimensions, zero boundary conditions, and constant coefficients, the solution operator maps the PDE’s righthand side f, and its physical parameters (constant) to the solution u(x,y), and only for certain simple geometries. We are lucky to have explicit formulas for all the quantities involved (none of which we write here):

    G ( f ( x , y ) , a ) = u ( x , y ) = D G r e e n F u n c t i o n ( x , y ; s , p ; a ) f ( s , p ) d s d p

    In this case, the neural operator network does the following iteration to approximate the true operator:

    G ( f ( x , y ) , a ) = u ( x , y ) u n+1 ( x , y ) = σ ( D k e r n e l ( x , y , s , p , a ; ω ) u n ( s , p ) d s d p + W u n ( x , y ) )
  • 对于一维动力系统,解算子将 ODE 的物理参数(函数)映射到解u(t),并且我们有一个隐式积分方程,它满足:

    G A t = t = 0 + t 0 t F G A s , A s , s d s

    在这种情况下,神经算子网络执行以下迭代来逼近真实算子:

    G A t = t n+1 t = σ t 0 t k e r n e t , s , A t , A s ; ω n s d s + n t

    这里,一个数据点是一个三元组( t , a ( t ), G ( a ( t ))),因此一个特定输入a可能出现在具有不同t值的多个数据点中。例如,大小为 10,000 的数据集只能从 100 个a ( t ) 轨迹生成,并且每个轨迹评估100t 个位置的G ( a )( t ) 。

  • For a one-dimensional dynamical system, the solution operator maps the ODE’s physical parameter (function) to the solution u(t), and we have an implicit integral equation that it satisfies:

    G ( a ( t ) ) = u ( t ) = u 0 + t 0 t f ( G ( a ( s ) ) , a ( s ) , s ) d s .

    In this case, the neural operator network does the following iteration to approximate the true operator:

    G ( a ( t ) ) = u ( t ) u n+1 ( t ) = σ ( t 0 t k e r n e l ( t , s , a ( t ) , a ( s ) ; ω ) u n ( s ) d s + W u n ( t ) )

    Here, one data point is a triplet (t, a(t), G(a(t))), and thus one specific input a may appear in multiple data points with different values of t. For example, a data set of size 10,000 may only be generated from 100 a(t) trajectories, and each evaluates G(a)(t) for 100 t locations.

使用这三种不同的设置需要观察的一件事是,神经算子网络仅需要输入和输出数据,而不需要了解底层的偏微分方程。有关偏微分方程的知识隐含在训练数据中。考虑到这一点,让我们强化神经算子网络的输入输出形式:

One thing to observe with these three different settings is that neural operator networks require only input and output data, and no knowledge of the underlying PDEs. The knowledge about the PDEs is implicit in the training data. With this in mind, let’s reinforce the input output form of a neural operator network:

  • PDE ≈ 学习算子的解(PDE 的物理参数、定义域、边界条件、初始条件等)
  • Solution of PDE ≈ learned operator(PDE’s physical parameters, domain, boundary conditions, initial conditions, etc.)

重要问题

The important questions

当我们超越本书来扩展我们对神经算子网络的知识时,我们必须以以下问题为指导:

When we branch out beyond this book to expand our knowledge on neural operator networks, we must keep the following questions as our guide:

对于给定的偏微分方程,网络的输入是什么,输出是什么?
For a given PDE, what is the input to the network, and what is the output?

我们在热方程、泊松方程和动力系统的简单背景下解决了这些问题。

We addressed these in the simple contexts of the heat equation, Poisson equation, and dynamical systems.

神经算子架构的例子是什么?
What is an example of an architecture of a neural operator?

图13-14展示了DeepONet的输入输出结构。输入是离散对 ( t , a ( t )),输出是离散G ( a ( t ))。

Figure 13-14 shows the input and output structure of DeepONet. The input is a discretized pair (t,a(t)), and the output is a discrete G(a(t)).

电子邮件 1314
图 13-14。(A) 学习算子 G(a(t)) 的网络需要两个输入 A t 1 , A t 2 , , A t 和T。(B) 训练数据说明(图像源)。
我们如何处理输入在维度上存在如此巨大差异的事实,例如同时是有限维度和无限维度?
How do we deal with the fact that the inputs have such drastic differences in dimension, such as finite dimensional and infinite dimensional at the same time?

换句话说,我们如何在训练和推理过程中离散化所涉及的有限维量(时间、空间等自变量)和无限维量(解函数、参数函数、边界条件、初始条件等)?请注意,对于学习有限维映射的神经网络,输入(表格、图像、音频文件、图表、自然语言文本)始终具有相同的维度,经过预处理以具有相同的维度,或者网络本身处理固定维度的部分单独输入。

In other words, how do we discretize the involved finite dimensional (independent variables such as time and space) and infinite dimensional quantities (solution functions, parameter function, boundary conditions, initial conditions, etc.) during training and inference? Note that for neural networks that learn finite dimensional mappings, inputs (tables, images, audio files, graphs, natural language text) always have the same dimension, are preprocessed to have the same dimension, or the network itself processes fixed dimension portions of the input individually.

我们如何避免许多偏微分方程求解方法中的常见陷阱,即最终依赖于离散化?
How do we avoid the common trap in many PDE solution methods, which end up being discretization dependent?

在什么意义上,神经算子网络是无网格的,并且能够概括其学习参数以适用于除它们所训练的离散化之外的其他离散化?这里的显着进步是神经算子网络是离散化不变的,在不同的离散化之间共享相同的网络参数。这意味着它们的输出不依赖于底层离散化,并且可以与不同的网格表示一起使用。

In what sense are neural operator networks meshless and able to generalize their learned parameters to work for other discretizations than the ones they have been trained on? The significant advancement here is that neural operator networks are discretization invariant, sharing the same network parameters between different discretizations. This means that their outputs do not depend on the underlying discretization and can be used with different grid representations.

我们如何加快神经算子中涉及的积分的计算时间并降低成本?
How do we speed up the computation time for the integrals involved in the neural operators and make it less costly?

我们涉及到傅里叶变换。傅里叶神经网络加速了计算所涉及积分的过程,将输入转换为傅里叶空间。它具有可用的快速傅里叶变换方法。我们将在下一小节中讨论的傅立叶神经网络实现了这一点。

We involve the Fourier transform. A Fourier neural network speeds up the process of computing the involved integrals transforming the inputs to Fourier space. This has the fast Fourier transform methods at its disposal. A Fourier neural network, which we discuss in the next subsection, implements this.

神经算子网络如何处理涉及数百或数千个变量的高维偏微分方程?
How do neural operator networks fare with high-dimensional PDEs involving hundreds or thousands of variables?

使用所有基础资产(Black-Scholes)、具有许多参与主体的博弈论设置(Hamilton-Jacobi-Bellman)或具有许多粒子的物理系统对金融市场进行建模的偏微分方程都是非常高维的。每个维度的离散化都会使计算上已经很大的问题变得更加严重,并且到目前为止,这些优雅的偏微分方程的任何实际实现都是不可行的。我们很快就会讨论的文章“使用深度学习求解高维微分方程”(Han et al. 2018)使用人工智能技术来解决此类偏微分方程,但最好将该文章的方法与深度神经算子设置进行比较。

PDEs that model financial markets with all the underlying assets (Black-Scholes), game theoretic settings with many participating agents (Hamilton-Jacobi-Bellman), or physical systems with many particles are very high-dimensional. Discretization in each of these dimensions explodes the size of an already big problem computationally and has until now made any practical implementation of these elegant PDEs infeasible. The article “Solving High-Dimensional Differential Equations Using Deep Learning” (Han et al. 2018), which we discuss soon, uses AI techniques to address such PDEs, but it would be nice to compare that article’s methodology to a deep neural operator setting.

傅立叶神经网络

Fourier neural network

加州理工学院最近开源了用于求解偏微分方程的傅立叶神经网络;其方法在文章“Fourier Neural Operator for Parametric Partial Differential Equations”(Li et al. 2021)中有所展示。这些网络可以近似高度非线性、具有高频模式和缓慢能量衰减的偏微分方程的解算子。

The California Institute of Technology has recently open sourced its Fourier neural network for solving partial differential equations; its approach is shown in the article “Fourier Neural Operator for Parametric Partial Differential Equations” (Li et al. 2021). These networks can approximate solution operators for PDEs that are highly nonlinear, with high frequency modes and slow energy decay.

傅里叶神经网络中的每一层对其输入数据应用快速傅里叶变换,然后是线性变换,然后是快速傅里叶逆变换。这会导致准线性计算复杂度,即O ( n多项式(log( n ))) 阶,并使模型对数据的空间分辨率不变(即使它仍然需要均匀的网格)。

Each layer in a Fourier neural network applies a fast Fourier transform to its input data, then a linear transform, then an inverse fast Fourier transform. This results in a quasi-linear computational complexity, that is, of order O(n polynomial(log(n))) and makes the model invariant to the spatial resolution of the data (even though it still requires a uniform mesh).

图 13-15显示了傅里叶神经网络的架构。

Figure 13-15 shows the architecture of the Fourier neural network.

电子邮件1315
图 13-15。傅立叶神经网络的架构(图片来源

输入是物理参数a ( x ),输出是偏微分方程解u ( x ):

The input is the physical parameter a(x), and the output is the PDE solution u(x):

  1. 从输入a ( x ) 开始。

  2. Start from input a(x).

  3. 通过浅层全连接神经网络P提升到更高维度的通道空间:v 0( x ) = P ( a ( x ))。

  4. Lift to a higher-dimension channel space by a shallow, fully connected neural network P: v0(x) = P(a(x)).

  5. 应用积分算子和激活函数的多个傅里叶层。在每一层中,我们应用傅里叶变换FT ,对较低傅里叶模式应用线性变换R并滤除较高模式,然后应用傅里叶逆变换 F 时间 -1 。在底部,应用局部线性变换W

  6. Apply several Fourier layers of integral operators and activation functions. In each of these layers, we apply the Fourier transform F.T, apply a linear transform R on the lower Fourier modes and filter out the higher modes, and apply the inverse Fourier transform F . T -1 . On the bottom, apply a local linear transform W.

  7. 通过神经网络Q投影回目标维度。最后得到输出u ( x ) = Q ( vT ( x )),它是局部变换Q对v的投影,也由浅层全连接神经网络参数化。

  8. Project back to the target dimension by a neural network Q. Finish with the output u(x) = Q(vT (x)), which is the projection of v by the local transformation Q also parameterized by a shallow fully connected neural network.

  9. 完成输出u ( x )。

  10. Finish with the output u(x).

本文通过各种重要的偏微分方程演示了该方法:

The article demonstrates the method with a variety of important PDEs:

  • 伯格斯方程

  • Burgers’ equation

  • 达西流

  • Darcy flow

  • 纳维-斯托克斯方程

  • Navier-Stokes equation

  • 其他方法不同的情况下的湍流

  • Turbulent flows in regimes where other methods diverged

傅立叶神经网络具有网格不变性,因此可以在较低分辨率上进行训练并以较高分辨率进行评估,而无需看到任何更高分辨率的数据(零样本超分辨率)。

The Fourier neural network is mesh invariant, so it can be trained on a lower resolution and evaluated at a higher resolution, without seeing any higher resolution data (zero-shot super-resolution).

由于数据驱动方法依赖于数据的质量和数量,因此我们需要通过使用其他一些方法求解实际的偏微分方程来生成神经算子网络的输入和输出的训练对。为此,作者指出,学习纳维-斯托克斯方程 v s C s t y = 1 e -4 ,我们需要生成N = 10,000 个训练对 A X , X 使用数值求解器。对于更具挑战性的偏微分方程,即使生成一些训练样本也可能非常昂贵。未来的方向是将神经算子与数值求解器结合起来以提高要求关于数据。

Since data-driven methods rely on the quality and quantity of data, we need to generate training pairs of inputs and outputs for the neural operator networks by solving the actual PDEs using some other methods. To this end, the authors note that to learn Navier-Stokes equation with v i s c o s i t y = 1 e -4 , we need to generate N = 10,000 training pairs ( a ( x ) , u ( x ) ) using a numerical solver. For more challenging PDEs, generating even a few training samples can be very expensive. A future direction would be to combine neural operators with numerical solvers to raise the requirements on data.

算子万能逼近定理的陈述

Statement of the universal approximation theorem for operators

认为 σ 是连续非多项式函数,X是 Banach 空间, K 1 X , K 2 d 是两个紧集,V是紧集 C K 1 G是一个非线性连续算子,它将V映射为 C K 2 。那么对于任意 ε > 0 ,有正整数npm和常数 C k , Ψ ,j k , θ k , Ψ k ε , ω k ε d , X j ε K 1 , = 1 , , n , k = 1 , , p , j = 1 , , ,这样:

Suppose that σ is a continuous nonpolynomial function, X is a Banach space, K 1 X , K 2 d are two compact sets, V is a compact set in C ( K 1 ) , and G is a nonlinear continuous operator that maps V into C ( K 2 ) . Then for any ϵ > 0 , there are positive integers n, p, m, and constants c i k , ξ i,j k , θ i k , ξ k , ω k d , x j K 1 , i = 1 , , n , k = 1 , , p , j = 1 , , m , such that:

| G y - Σ k=1 p Σ =1 n C k σ Σ j=1 Ψ j k X j + θ k σ ω k y + Ψ k | < ε

适合所有人 ε V y ε K 2 。请注意,该近似定理仅使用神经网络中的一个隐藏层,但没有指定该层有多少个节点。在应用中,就像在有限维情况下一样,我们使用不止一层。

holds for all u V and y K 2 . Note that this approximation theorem only uses one hidden layer in the neural network but does not specify how many nodes this layer has. In applications, just like in the finite dimensional case, we use more than one layer.

不要被大词和希腊字母吓倒。这个定理告诉我们的是,我们有理论依据来制定神经网络算子,并期望它能够很好地逼近 PDE 解算子。尽管我们可能永远不知道 PDE 解算子的确切公式,但我们构建的算子神经网络可以作为一个非常好的代理。这就是我们都热爱近似定理的原因,并且应该永远感谢发现它们的数学家。

Do not be intimidated by the big words and the Greek letters. What this theorem tells us is that we have theoretical grounds to formulate a neural network operator, and expect it to approximate the PDE solution operator very well. Even though we may never know the exact formula for the PDE solution operator, the operator neural network that we construct acts as a very good proxy. This is the reason we are all in love with approximation theorems, and should be eternally grateful to the mathematicians who find them.

我们在定点迭代对解的 加性逼近的背景下提到过这一点,值得再次提及的是:无论是逼近有限维空间之间的映射还是无限维空间之间的映射,神经网络都表示函数、泛函或运算符使用简单函数的组合(线性组合或由非线性激活函数组成的线性积分算子)来近似复杂的函数。这与经典近似方法不同,经典近似方法是加法近似而不是组合近似。

We have mentioned this in the context of the fixed point iteration’s additive approximation of a solution, and it is worth mentioning again: whether approximating a mapping between finite dimensional spaces or a mapping between infinite dimensional spaces, neural networks represent functions, functionals, or operators using compositions of simple functions (a linear combination or linear integral operator composed with a nonlinear activation function) to approximate complicated ones. This is different than classical approximation approaches, where the approximation is additive and not compositional.

我们如何拓展并深入研究更多的技术细节?

How do we branch out and dive into the more technical details?

为了更深入地研究神经算子网络,关于这一主题的三份重要出版物是:

For a deeper dive into neural operator networks, the three important publications on this topic are:

高维微分方程的数值解

Numerical Solutions of High-Dimensional Differential Equations

微分方程具有普遍性;他们几乎可以模拟我们能想到的任何东西,包括我们的日常通勤和交通。很难接触到微分方程,然后不去思考我们发现自己所处的每种情况都适合某种微分方程。也就是说,维度的诅咒自该领域诞生以来就一直困扰着它,并阻碍了许多实际应用。这就是为什么许多偏微分方程入门课程误导性地只关注一维和二维微分方程,好像这就是全部。如果给人工智能起一个不那么华而不实、绝对不会出现在电影里的名字的话,那就是:高维数据的处理、计算和分析。这并不是贬低人工智能,因为高维数据的处理、计算和分析正是人类每天所做的事情,只要我们为其添加创造性维度(在人工智能中,这将转化为生成模型)。因此,深度学习被证明是寻找极高维微分方程数值解的合适设置也就不足为奇了。这是“使用深度学习求解高维微分方程”(Han et al. 2018)一文的重点,该文章解决了数百甚至数千维的偏微分方程。这样,我们可以同时包含所有参与的代理、资产、资源或粒子,而不是人为地设计关于它们的相互作用和连接的手工假设。作者考虑了多个高维偏微分方程,包括 Hamilton-Jacobi-Bellman 方程(在数百个智能体中,每个交互智能体的最优策略是什么?)和 Black-Scholes 方程(欧洲索赔的合理价格是多少)基于一百个基础资产,假设尚未发生违约?)。

Differential equations are universal; they can model almost anything we can think of, including our daily commute and traffic. It is hard to be exposed to differential equations, then not think of each situation that we find ourselves in as fitting into some sort of differential equation. That said, the curse of dimensionality has haunted this field since its inception, and has stood in the way of many practical applications. This is why many introductory PDE courses misleadingly focus only on one- and two-dimensional differential equations, as if that is all there is. If AI was to be given a different name that is not as flashy and definitely would not make it into any movies, it would be: processing, computation, and analysis of high-dimensional data. This does not talk AI down, because processing, computation, and analysis of high-dimensional data are exactly what humans do on a daily basis, provided we add a creative dimension to it (which in AI would translate to generative models). It is then not surprising that deep learning turns out to be an appropriate setting for finding numerical solutions of very high-dimensional differential equations. This is the focus of the article “Solving High-Dimensional Differential Equations Using Deep Learning” (Han et al. 2018), which addresses PDEs with hundreds or even thousands of dimensions. This way, we can include all participating agents, assets, resources, or particles at the same time, instead of artificially devising handmade assumptions about their interactions and connections. The authors consider multiple high-dimensional PDEs, including the Hamilton-Jacobi-Bellman equation (what is the optimal strategy for each interacting agent, among hundreds of agents?) and the Black–Scholes equation (what is the fair price of a European claim based on one hundred underlying assets, given that no default has occurred yet?).

当我们依靠深度学习设置作为模型的基础时,例如,计算高维偏微分方程的解决方案时,我们必须问的第一个问题是深度学习网络的输入是什么,输出是什么。对于任何解为u(x,t)的偏微分方程,理想情况下我们将输入xt并输出u(x,t)在这种情况下x可以是一个极高维的 X 如果x的条目具有任何固有的随机性,例如金融市场资产的价格,那么我们必须对它们进行建模,如果不这样做,那么我们通常会假设某种平均。最重要的是,对于许多实际情况,我们可以输入x作为X,这是一个随机过程。我们必须从数学上定义它。

When we rely on a deep learning setting as a basis for our models, for example, for computing solutions of high-dimensional PDEs, the first question we must ask is what is the input and what is the output of our deep learning network. For any PDE whose solution is u(x,t), ideally we would input x and t and output u(x,t). x in this case can be an extremely high-dimensional x . If the entries of x have any inherent stochasticity into them, such as prices of financial market assets, then we must model them as such, and if we don’t, then we are usually assuming some sort of averaging. The bottom line is, for many realistic cases, we can input x as X, a stochastic process. We must define this mathematically.

上述文章中的一大步是将高维偏微分方程重新表示为后向随机微分方程,然后将X输入到近似解梯度的神经网络中。为了掌握这里所需的基本数学,我们必须定义:

One big step in the aforementioned article is reformulating the high-dimensional PDEs as backward stochastic differential equations before inputting X into a neural network that approximates the gradient of the solution. To master the essential math required here, we must define:

  • 布朗运动(见第11章

  • Brownian motion (see Chapter 11)

  • 随机过程(参见第 11 章

  • Stochastic process (see Chapter 11)

  • 随机微分方程(这超出了本书的范围)

  • Stochastic differential equation (this is beyond the scope of the book)

  • 将非线性抛物型偏微分方程与随机偏微分方程联系起来(这超出了本书的范围)

  • Relating nonlinear parabolic PDEs to stochastic PDEs (this is beyond the scope of the book)

  • 向后随机微分方程(这超出了本书的范围)

  • Backward stochastic differential equation (this is beyond the scope of the book)

我们必须回答这个问题:为什么在训练网络之前我们必须将偏微分方程重新表述为随机形式?这种形式给我们带来什么好处呢?这超出了本书的范围,但您现在知道要查找什么以及要问什么问题。

And we must answer the question: why did we have to reformulate the PDE into a stochastic form before training the network? What advantage does this form give us? This goes beyond the scope of the book, but you now know what to look for and what questions to ask.

最后,该方法为求解许多高维微分方程打开了大门,但也存在局限性。由于计算困难,该方法无法应用于量子多体问题处理泡利不相容原理

Finally, the method opens the door to solving many high-dimensional differential equations, but there are limitations. The method cannot be applied to the quantum many body problem due to the difficulty in dealing with the Pauli exclusion principle.

直接从数据模拟自然现象

Simulating Natural Phenomena Directly from Data

我们有本章讨论过一次粒子系统。我们使用统计力学框架来描述粒子尺度上系统状态的概率,然后用它们来写出在宏观尺度上模拟系统时间演化的偏微分方程。

We have addressed particle systems once in this chapter. We used a statistical mechanics framework to describe the probabilities of the states of the system at the particle scale, then we used those to write down PDEs modeling the time evolution of the system at the macro scale.

在本节中,我们将解释最新的基于神经网络的模型如何模拟粒子系统并在编写任何偏微分方程的情况下预测其演化。换句话说,我们绕过偏微分方程并用它们来从数据中学习

In this section, we explain how recent neural network–based models simulate a particle system and predict its evolution without writing any PDEs. In other words, we bypass PDEs and trade them for learning from data.

为了在颗粒尺度上跟踪特定粒子系统(例如水或沙子)的演化,我们需要知道每个粒子的位置向量 p t 在每个时间步t。这些位置如何变化取决于粒子与其邻居之间的局部和远程相互作用(例如交换能量和动量),这些相互作用由系统的物理性质和外部效应(例如重力、温度、力、我们可以训练神经网络来学习粒子系统给定状态之间的映射,而不是为这些相互作用写下显式方程,并将它们与粒子的位置、速度和/或加速度联系起来。某个时间(输入)及其所有粒子(或速度或加速度)在未来时间(输出)的位置。图网络非常适合对粒子系统进行建模,因为每个粒子及其状态都可以是一个节点,而边及其特征可以对特定粒子之间的相互作用进行建模。

To track the evolution of a certain particle system (such as water or sand) at a granular scale, we need to know each particle’s position vector p i ( t ) at each time step t. How these positions change depends on local and long-range interactions between a particle and its neighbors (such as exchanging energy and momentum), which are dictated both by the physical nature of the system and by external effects such as gravity, temperature, forces, magnetic fields, etc. Instead of writing down explicit equations for these interactions, and relating them to the particles’ positions, velocities, and/or accelerations, we can train a neural network to learn a map between a given state of a particle system at a certain time (input) and the positions of all of its particles (or velocities or accelerations) at a future time (output). Graph networks are well suited to model particle systems, since each particle along with its state can be a node and the edges along with their features model the interactions between specific particles.

我们重点介绍并评论了最近学习此类映射的工作中的一般思想:“学习使用图网络模拟复杂物理”(Sanchez-Gonzalez 等人,2020)。

We highlight and comment on the general ideas from a recent work that learns such a map: “Learning to Simulate Complex Physics with Graph Networks” (Sanchez-Gonzalez et al. 2020).

首先,我们需要训练数据
First, we need training data

我们可以从某个粒子系统的观察或模拟轨迹的数据集中生成输入(粒子系统及其在特定时间的特征)和目标(每个粒子在稍后时间的加速度)对。例如,根据过去 5 个状态,团队从 1,000 步长的轨迹中生成 995 对。在数据集中,我们只需要位置向量,并且可以使用有限差分导出速度和加速度向量。数据集通常包含 1,000 个训练轨迹、100 个验证轨迹和 100 个测试轨迹,每个轨迹模拟 300-2,000 个时间步长,根据各种材料达到稳定平衡的平均持续时间进行定制。

We can generate input (particle system and its features at a certain time) and target (each particle’s acceleration at a later time) pairs from a data set of observed or simulated trajectories of a certain particle system. For example, from a 1,000-step-long trajectory, the team generates 995 pairs, conditioning on the 5 past states. In the data sets, we only need the position vectors, and we can derive velocity and acceleration vectors using finite differences. The data sets typically contain 1,000 train, 100 validation, and 100 test trajectories, each simulated for 300–2,000 time steps, tailored to the average duration for the various materials to come to a stable equilibrium.

接下来,我们需要构建从输入到输出的映射(网络组件)
Next, we need to build the map from input to output (the network components)

从系统在整数时间t处的某个状态开始 X t = X 0 t , , X t ,其中N 个粒子中的每一个 X t 表示它在时间t 的状态(包括它的位置 p t 以及其他特性,例如质量、材料特性等)。

接下来,学习代表状态的地图 X t = X 0 t , , X t 如图G =(节点、边和全局属性,也可以作为节点特征包含在内)。节点嵌入, nde = F n C t n X ,是粒子状态的学习函数(使用多层感知器)。添加有向边以在具有某些潜在相互作用的粒子节点之间创建路径。边缘嵌入, e ,j = F n C t n r ,j ,是相应粒子的成对属性的学习函数(使用多层感知器) r ,j ,例如,它们位置之间的位移、弹簧常数等。

然后学习图到图的映射。这通过M 个学习消息传递步骤来计算节点之间的交互,以生成一系列更新的潜在图, G = G 1 , , G 中号 。然后返回最终的图。消息传递允许信息通过边缘在节点之间传播,并遵守约束。这样,系统的复杂动态就可以通过在其本地邻域内的节点之间学习的消息传递来近似。此外,最终图具有与第一个图相同的结构,但具有可能不同的节点、边和图级属性。

然后,学习从最终图到提取系统动力学的矩阵的映射(多层感知器),例如粒子加速度的矩阵 = p 1 '' , p 2 '' , , p '' 最后,使用Y中加速度的欧拉积分器更新粒子的位置和速度。这反过来又将系统的状态更新为 X t+1

Start with the system at a certain state at integer time t X t = ( x 0 t , , x N t ) , where each of the N particles’ x i t represents its state at time t (which includes its position p i t and other characteristics such as mass, material properties, etc.).

Next, learn a map that represents the state X t = ( x 0 t , , x N t ) as graph G = (nodes, edges, and global properties that can alternatively be included as node features). The node embeddings, node i = f u n c t i o n ( x i ) , are learned functions (using a multilayer perceptron) of the particles’ states. Directed edges are added to create paths between particle nodes that have some potential interaction. The edge embeddings, e i,j = f u n c t i o n ( r i,j ) , are learned functions (using a multilayer perceptron) of the pair-wise properties of the corresponding particles r i,j , for example, displacement between their positions, spring constant, etc.

Then learn a graph-to-graph map. This computes the interactions among the nodes via M steps of learned message passing to generate a sequence of updated latent graphs, G = ( G 1 , , G M ) . Then return the final graph. Message passing allows information to propagate between the nodes via the edges, and for the constraints to be respected. This way, the complex dynamics of the system are approximated by learned message passing among the nodes within their local neighborhoods. Moreover, the final graph has the same structure as the first graph, but with potentially different node, edge, and graph-level attributes.

Then, learn a map (multilayer perceptron) from the final graph to a matrix that extracts the dynamics of the system, for example, the matrix of the particles’ acceleration Y = ( p 1 '' , p 2 '' , , p N '' ) . Finally, the particles’ positions and velocities are updated using an Euler integrator of the accelerations in Y. This in turn updates the system’s state to X t+1 .

此类模型不仅限于材料和粒子系统,还可以对具有许多相互作用的代理的系统进行建模,例如机器人控制系统。它们是朝着真实模拟复杂现象迈出的一大步,这对于科学和人类学具有重要价值。工程。

Such models are not restricted to materials and particle systems, but can model systems with many interacting agents, such as robotic control systems. They are a great step toward simulating complex phenomena authentically, which is of great value to science and engineering.

动态规划的 Hamilton-Jacobi-Bellman PDE

Hamilton-Jacobi-Bellman PDE for Dynamic Programming

Hamilton-Jacobi-Bellman 方程是另一个偏微分方程,只要我们能够在高维度上求解,它的解就能开启经济学、运筹学和金融领域的许多可能性。简而言之,我们正在寻找一种最佳策略(例如投资策略),以保证在给定时间内实现最低的实施成本。理想情况下,我们希望包括数百或数千个交互代理,例如投资银行的所有金融资产,而不是缩小到不切实际的代表代理模型。正如我们在本章前面所看到的,这就是使用神经网络寻找高维偏微分方程数值解对我们有帮助的地方。

The Hamilton-Jacobi-Bellman equation is yet another PDE whose solution unlocks many possibilities in economics, operations research, and finance, if only we are able to solve it in high dimensions. In a nutshell, we are searching for an optimal strategy (such as an investment strategy) that guarantees some minimal implementation cost over a given period of time. Ideally, we would like to include hundreds or thousands of interacting agents, such as all the financial assets for investment banking, instead of downsizing to unrealistic representative agent models. This is where using neural networks to find numerical solutions for high-dimensional PDEs helps us, as we saw earlier in this chapter.

从数学上来说,Hamilton-Jacobi-Bellman PDE 非常丰富。它结合了动力系统( Xt dt = F X t , A t , t )、偏微分方程(偏导数和等式)和优化(最大最小问题)。当我们学习如何从现实世界的应用中导出这个偏微分方程,尝试理解它,找到它的解,并分析这些解(存在性、唯一性、平滑性等)时,我们就掌握了大量的数学知识。

Mathematically, the Hamilton-Jacobi-Bellman PDE is very rich. It combines dynamical systems ( x(t) dt = f ( x ( t ) , a ( t ) , t ) ), PDEs (partial derivatives and equalities), and optimization (max or min problems). When we learn how to derive this one PDE from real-world applications, attempt to understand it, find its solutions, and analyze these solutions (existence, uniqueness, smoothness, etc.), we acquire a ton of math.

此外,这个偏微分方程与人工智能中的强化学习直接相关,但我们不是从概率角度考虑强化,而是从马尔可夫决策过程的角度考虑强化,例如第11章,而是从确定性动态规划的角度考虑强化。

Moreover, this PDE is directly related to reinforcement learning in AI, but instead of thinking about reinforcement probabilistically, in terms of Markov decision processes, such as in Chapter 11, we think about reinforcement in terms of deterministic dynamic programming.

在动态规划设置中,交互代理的状态捆绑在一个向量中 X t ,根据动态系统随时间演化,我们需要找到一种优化策略来引发该动态系统的特殊解决方案:在给定时间段内产生最小成本的解决方案。思路是这样的:某个与时间相关的策略会影响动态系统的行为,进而影响所产生的成本。所有这些都是数学量。

In the dynamic programming setting, the states of the interacting agents, bundled in a vector x ( t ) , evolve in time according to a dynamic system, and we need to find an optimizing policy that induces a special solution of this dynamic system: the one that incurs a minimal cost over a given period of time. The train of thought goes like this: a certain time-dependent policy affects the behavior of a dynamic system, which in turn affects the incurred cost. All of these are mathematical quantities.

的贡献理查德·贝尔曼(Richard Bellman,1920-1984)对于动态规划领域(在给定时间内为不断发展的系统寻找最佳策略)具有无价的价值。我们很快就会遇到贝尔曼的最优原理,事实上,正是贝尔曼创造了维度诅咒这个术语。这个原理非常有用,因为它将所考虑的时间段内所涉及的优化问题分解为更小的时间间隔内的更小的子问题,然后我们可以以递归方式求解。

The contributions of Richard Bellman (1920–1984) to the dynamic programming field (finding optimal strategies for an evolving system over a given period of time) are invaluable. We will encounter Bellman’s principle for optimality shortly, and in fact, it is Bellman who coined the term curse of dimensionality. This principle is tremendously helpful, as it breaks down the involved optimization problem over the considered period of time into smaller subproblems at smaller time intervals, which we can then solve in a recursive manner.

确定性和随机设置下的贝尔曼方程

Bellman’s equation in deterministic and stochastic settings

确定性动态规划设置,有:

In a deterministic dynamic programming setting, there are:

离散时间贝尔曼方程
Discrete time Bellman’s equation

通过选择最佳策略(或控制或策略),我们可以找到从当前时间开始到最终时间的价值函数 A k 在当前时间步k处,使得当前成本加上下一个时间步的价值函数最小化。这是一个递归过程:

V A e X k , n = 分钟 A k C s t X k , A k + V A e X k+1 , n - 1

其中n是最终时间步长,离散时间动态为:

X k+1 = F X k , A k

以便:

V A e X k , n = 分钟 A k C s t X k , A k + V A e F X k , A k , n - 1

优化器的顺序 A k 在每个离散时间步k构成整个时间段的最优策略(或策略或控制),并保证总成本最小,就像强化学习一样。

We can find the value function starting from the current time until the final time by picking the best strategy (or control or policy) a k at the current time step k so that the current cost plus the value function at the next time step are minimized. This is a recursive process:

V a l u e ( x k , n ) = min a k ( C o s t ( x k , a k ) + V a l u e ( x k+1 , n - 1 ) )

where n is the final time step, and the discrete time dynamics are:

x k+1 = f ( x k , a k )

so that:

V a l u e ( x k , n ) = min a k ( C o s t ( x k , a k ) + V a l u e ( f ( x k , a k ) , n - 1 ) )

The sequence of optimizers a k at each discrete time step k constitute the optimal policy (or strategy or control) for the whole time period, and guarantee the minimal total cost, exactly as in reinforcement learning.

连续时间贝尔曼方程
Continuous time Bellman’s equation

这就是 Hamilton-Jacobi-Bellman 偏微分方程。

This is the Hamilton-Jacobi-Bellman PDE.

随机最优控制设置中,还有贝尔曼方程的随机版本。这广泛适用于投资银行以及调度和路由问题。在随机框架中,我们需要找到一个最优控制输入(策略或政策),以最小的成本引导底层随机过程达到某种所需的最终状态。例如,考虑这样的问题:我们需要在一定的时间内以最小的实施成本执行财务订单。我们可以首先对基础资产的短期动态进行建模,然后在时间和状态空间上进行离散化。这允许我们在每个时间步执行给定数量的股票,条件是我们必须在给定时间段内执行所有股票。我们寻找一种政策,该政策告诉我们,在每个时间点可以采取的所有可能行动中,哪种行动是最佳行动,可以让我们达到我们想要的目标。

In a stochastic optimal control setting, there is also a stochastic version of Bellman’s equation. This is widely applicable in investment banking, and in scheduling and routing problems. In the stochastic framework, we need to find an optimal control input (strategy or policy) that guides the underlying stochastic processes to some desired final state, with minimal cost. Consider for example the problem where we need to execute a financial order, with a minimal implementation cost and within a certain period of time. We can first model the short-term dynamics of the underlying assets, then discretize both in time and state spaces. This allows us to execute a given amount of shares at each of the time steps, with the condition that we must execute all of the shares during the given time period. We search for the policy that tells us, among all possible actions that we can take at each point in time, what the optimal one is that gets us to where we want to be.

第11章中,我们将贝尔曼方程与强化学习联系起来。这是在具有价值函数的马尔可夫决策过程的背景下完成的:

In Chapter 11, we connected Bellman’s equation to reinforcement learning. This was done in the context of a Markov decision process with value function:

V A e s = 最大限度 状态行动 𝔼 r e w A r d 0 + γ V A e s '

在确定性动态规划设置中,类似的方程是值函数的 Hamilton-Jacobi-Bellman PDE。在写它的公式之前,这是我们需要注意的语言:

In a deterministic dynamic programming setting, the analogous equation is the Hamilton-Jacobi-Bellman PDE for the value function. Before writing its formula, this is the language that we need to pay attention to:

最小化成本函数
Minimizing a cost function

这个世界上还有比这更共同的目标吗?

Is there a more common objective in this world?

选择最优控制或最优策略
Choosing an optimal control or optimal policy

这就是我们正在寻找的最小化器;它控制动力系统。

This is the minimizer that we are looking for; it controls the dynamical system.

价值函数
Value function

所考虑的时间段内的最低总成本。

The total minimum cost over the considered period of time.

贝尔曼最优原理
Bellman optimality principle

这是一个非常有用的原则,它使我们能够简化优化问题

An amazingly helpful principle that allows us to simplify the optimization problem.

时间倒退解决方案
Backward in time solution

从期望的结果开始,然后向后工作,最佳地达到初始状态。很直观地看出为什么在这种情况下向后时间解决方案更容易。由于我们知道最终目标,因此我们立即从前一个时间步中排除所有未通向该目标的路径,从而节省了对许多无用路径的探索。另一方面,如果我们在时间上向前求解,从时间间隔的开始处开始,那么我们就不具有接近的优势期望结果的优势,因此我们必须浪费时间和计算资源来探索更多无用的路径。

Start with the desired outcome and work our way backward, optimally to an initial state. It is intuitive to see why a backward-in-time solution is easier in this setting. Since we know the end goal, we immediately exclude all paths that don’t lead to it from the preceding time step, saving us the exploration of many useless paths. If, on the other hand, we solve forward in time, starting at the beginning of the time interval, then we do not have the advantage of closeness to the desired outcome, so we must waste our time and computational resources exploring many more useless paths.

大局观

The big picture

最终的问题是:我们现在必须做什么(初始状态是什么 X t ntA 和随时间变化的政策 A t )让我们到达我们想去的地方( X t FnA ),以最具成本效益的方式(获得价值函数 V A e X t ntA , t ntA , t FnA ,哪个是策略实施成本函数的最小值)?

The ultimate question is: what must we do now (what is the initial state x ( t initial ) and the time-dependent policy a ( t ) ) to get us to where we want to be ( x ( t final ) ), in the most cost-efficient way (attain the value function V a l u e ( x ( t initial ) , t initial , t final ) , which is the minimum value of the strategy implementation cost function)?

涉及数量为:

The involved quantities are:

  • X t 是表征动态系统状态的向量。

  • x ( t ) is a vector characterizing the state of the dynamic system.

  • 战略(或政策或控制) A t 。我们需要设计它,以便它调用一个状态 X t 最小化一些成本函数。也就是说,如果我们输入这个特殊的 A t 我们正在寻找动态系统的输出 X t 将最小化成本函数。

  • The strategy (or policy or control) a ( t ) . We need to design this so that it invokes a state x ( t ) that minimizes some cost function. That is, if we input this special a ( t ) that we are looking for to the dynamic system, the output x ( t ) would minimize the cost function.

  • 成本函数 C s t X t , A t , t ntA , t FnA 由于战略(或政策或控制)的实施而发生。这是由一些终端成本给出的 t FnA 以及当我们从 t ntA t FnA 。增量成本取决于系统的当前状态和当前的控制。

  • The cost function C o s t ( x ( t ) , a ( t ) , t initial , t final ) incurred due to the implementation of the strategy (or policy or control). This is given by some terminal cost at t final and the sum of incremental costs (integral) as we transition from t initial to t final . The incremental costs depend on the current state of the system and the current control.

  • 价值函数 V A e X t ntA , t ntA , t FnA 是通过执行最小化策略在特定时间段内获得的最小成本 A * t ,这又指定了状态 X * t 使用有关系统动力学的信息。

  • The value function V a l u e ( x ( t initial ) , t initial , t final ) is the minimal cost in a specific time period attained by enforcing the minimizing policy a * ( t ) , which in turn specifies the state x * ( t ) using the information about the dynamics of the system.

汉密尔顿-雅可比-贝尔曼偏微分方程

Hamilton-Jacobi-Bellman PDE

这些是涉及的方程和公式:

These are the involved equations and formulas:

dXt dt = F X t , A t , t C s t X t , A t , t ntA , t FnA = C s t FnA X t FnA , t FnA + t ntA t FnA C s t nCreentA X s , A s d s V A e X t ntA , t ntA , t FnA = 分钟 A t C s t X t , A t , t ntA , t FnA

贝尔曼的最优性原理告诉我们一些关于价值函数(最优成本)沿特殊轨迹的行为的非常有价值的信息 X * t 对应优化策略 A * t :如果我们沿着特殊轨迹将时间间隔分开,则指定时间间隔的值是值的总和 X * t 对应优化策略 A * t 。这使我们能够将较长时间间隔内的优化问题分解为较短时间间隔内优化问题的递归:

Bellman’s optimality principle tells us something very valuable about the behavior of the value function (the optimal cost) along the special trajectory x * ( t ) corresponding to the optimizing policy a * ( t ) : the value at a specified time interval is the sum of values if we break the time interval apart along the special trajectory x * ( t ) corresponding to the optimizing policy a * ( t ) . This enables us to break up the optimization problem over a longer time interval into a recursion of optimization problems over much shorter time intervals:

V A e X * t ntA , t ntA , t FnA = V A e X * t ntA , t ntA , t nteredAte + V A e X * t nteredAte , t nteredAte , t FnA

利用贝尔曼原理,我们可以推导出值函数满足的Hamilton-Jacobi-Bellman PDE。该偏微分方程概括了旧的 Hamilton-Jacobi PDE 以实现最优控制。该偏微分方程的解包含非常有价值的信息。假设我们在任何时间 t遇到系统,而不仅仅是在其初始状态 t ntA ,然后我们可以通过求解 Hamilton-Jacobi-Bellman 方程来计算价值函数,直至达到所需的最终成本:

Using Bellman’s principle, we can derive the Hamilton-Jacobi-Bellman PDE that the value function satisfies. This PDE generalized an older Hamilton-Jacobi PDE for optimal control. The solution of this PDE contains very valuable information. Suppose we encounter the system at any time t, not only at its initial state t initial , then we can compute the value function up to the desired final cost by solving the Hamilton-Jacobi-Bellman equation:

- VAe t = 分钟 A t VAe X 时间 F X t , A t + C s t nCreentA X t , A t

以最终时间条件为准:

subject to final time condition:

V A e X t FnA , t FnA = C s t FnA X t FnA , t FnA

这是值函数的一阶偏微分方程 V A e X t , t , t FnA 。同样,这是从状态开始产生的最优成本 X t 在时间t,并从那时到时间最优地控制系统 t FnA 。我们知道最终的价值函数 t FnA ,我们正在寻找时间t的价值函数,即 V A e X t 所以我们从时间开始向后求解偏微分方程 t FnA 并结束于 t ntA

This is a first-order PDE for the value function V a l u e ( x ( t ) , t , t final ) . Again, this is the optimal cost incurred from starting in state x ( t ) at time t, and controlling the system optimally from then until time t final . We know the final value function at t final , and we are looking for value function at time t, namely, V a l u e ( x ( t ) ) . So we solve the PDE backward in time, starting at t final and ending at t initial .

求解 Hamilton-Jacobi-Bellman 偏微分方程

Solving the Hamilton-Jacobi-Bellman PDE

如果我们能够求解价值函数的 Hamilton-Jacobi-Bellman PDE,那么我们就知道最优控制 A * t ,这反过来又产生成本最低(或回报最高)的轨迹 X * t 从我们目前的状态来看 X * t ntA 到最终期望的状态 X * t FnA

If we are able solve the Hamilton-Jacobi-Bellman PDE for the value function, then we know the optimal control a * ( t ) , which in turn produces the least costly (or most rewarding) trajectory x * ( t ) from our current state x * ( t initial ) to the final desired state x * ( t final ) .

一般来说,Hamilton-Jacobi-Bellman PDE 没有平滑解,因此我们必须用弱解广义解来满足自己的要求。这是许多偏微分方程的共同主题,研究偏微分方程理论的人几乎完全专注于开发广义解决方案并理解它们所在的函数空间(索博洛夫空间等)。Hamilton-Jacobi-Bellman PDE 广义解的经典示例(我们仅提及而未详细说明)包括粘度解和极小极大解。

The Hamilton-Jacobi-Bellman PDE does not, in general, have a smooth solution, so we must satisfy ourselves with weak or generalized solutions. This is a common theme for many PDEs, and a person studying PDE theory focuses almost exclusively on developing generalized solutions and understanding the function spaces that they live in (Sobolov spaces, etc.). Classic examples of generalized solutions for the Hamilton-Jacobi-Bellman PDE, which we only mention without elaboration, include viscosity solutions and mini-max solutions.

人工智能对汉密尔顿-雅可比-贝尔曼方程的大量文献的贡献在于以极高的维度(如数百或数千)数值求解该方程。值函数是状态向量的函数 X t 基础资产或贡献代理的数量,如果有很多,那么 PDE 的维度就非常高。我们之前引用的论文“使用深度学习求解高维偏微分方程”除了其他重要且影响广泛的高维偏微分方程之外,还讨论了 Hamilton-Jacobi-Bellman 方程的数值解。

What AI contributes to the vast literature on the Hamilton-Jacobi-Bellman equation is numerically solving it in very high dimensions, as in hundreds or thousands. The value function is a function of the state vector x ( t ) of the underlying assets or contributing agents, and if there are many of these, then the PDE is very high dimensional. The paper we referenced earlier, “Solving High-Dimensional Partial Differential Equations Using Deep Learning”, addresses numerical solutions of the Hamilton-Jacobi-Bellman equation, in addition to other important and widely impactful high-dimensional PDEs.

Hopf 公式项通常与 Hamilton-Jacobi PDE 的解相关。对于一类无粘性 Hamilton-Jacobi 型偏微分方程,Darbon 和 Osher 在“Algorithms for Overcoming the Curse of Dimensionality for certain Hamilton-Jacobi Equations Ascending in Control Theory and Elsewhere”( 2016)中,开发了一种有效的高维算法Hamilton-Jacobi PDE,基于 Hopf 公式。

The term Hopf formulas is usually associated with solutions of Hamilton-Jacobi PDEs. For a class of inviscid Hamilton-Jacobi–type PDEs, Darbon and Osher, in “Algorithms for Overcoming the Curse of Dimensionality for Certain Hamilton-Jacobi Equations Arising in Control Theory and Elsewhere” (2016), developed an effective algorithm for high-dimensional Hamilton-Jacobi PDEs, based on the Hopf formulas.

动态规划和强化学习

Dynamic programming and reinforcement learning

使用学习动态规划最优策略的神经网络在某些圈子里被称为强化学习,在另一些圈子里被称为神经动态规划。神经网络以及配备它的机器学习预测当前和未来的行为如何影响长期累积成本或奖励,即该时间段的价值。我们当前和日常的投资策略如何影响我们的年度业绩?我们的第一个和后续的棋步如何影响游戏的整体结果?价值函数是与在每个(离散或连续)时间步骤遵循最优策略相对应的成本和回报的总和。

Using neural networks to learn optimal strategies for dynamic programming is called reinforcement learning in some circles and neuro-dynamic programming in others. The neural network, and the machine endowed with it, learns to anticipate how current and future actions affect a long-term cumulative cost or reward, the value of that time period. How do our current and daily investment strategies affect our annual performance? How do our first and subsequent chess moves affect the overall outcome of the game? The value function is the total of costs and rewards corresponding to following the optimal strategies at each (discrete or continuum) step of time.

神经网络在使用历史数据进行训练期间需要输入和输出。输入是状态以及该状态允许的所有潜在操作,输出是值(总成本和奖励)。例如,对于制定如何应对每个客户的业务模型的训练后,神经网络学习将客户的状态作为输入,并输出下一个动作序列,以最大化长期价值。查看“神经动态编程”(Bertsekas 等人,1996 年),了解神经动态编程的较旧但全面的解释以及使用人工神经网络来逼近贝尔曼方程中的值函数。这对于减少诅咒维数的影响非常有用:我们不需要存储和评估整个高维函数,我们只需要存储神经网络。

The neural network needs inputs and outputs during training with historical data. The inputs are the state and all the potential actions that are allowed at that state, and the output is the value (the total costs and rewards). After training, for example, for a business model that is strategizing on how to address each customer, the neural network learns to take the customer’s state as input, and outputs the next sequence of actions so as to maximize long-term value. Check “Neuro-dynamic Programming” (Bertsekas et al. 1996) for an older but thorough explanation of neuro-dynamic programming and the use of artificial neural networks for approximating the value function in Bellman’s equation. This is great for reducing the effects of the curse dimensionality: instead of storing and evaluating the whole high-dimensional function, we only need to store the parameters of the neural network.

人工智能的偏微分方程?

PDEs for AI?

上一节强调了动态规划和贝尔曼方程与人工智能的强化学习高度交织在一起的事实。

The previous section highlighted the fact that dynamic programming and the Bellman equation are highly intertwined with AI’s reinforcement learning.

此外,偏微分方程领域背后有一个分析库,研究各种函数、它们所处的空间、弱解和强解,以及各种意义上的各种收敛。如果有任何领域拥有工具来解开神经网络在近似许多数据生成过程(无论是联合概率分布还是确定性函数)方面成功背后的秘密,那么它就是偏微分方程领域。我们需要用定理和数学严谨性来支持神经网络,最终有助于其设计和架构优化。神经网络的神奇能力需要在分析的镜头下进行,而偏微分方程分析工具及其解决方案是一种有前途的前进方向。例子包括 Sobolev 训练(Czarnecki et al. 2017)。

Moreover, the field of PDEs has an arsenal of analysis behind it, studying all kinds of functions, the spaces they live in, weak and strong solutions, and all kinds of convergence in all kinds of senses. If any field has the tools to unlock the secrets behind the success of neural networks in approximating many data-generating processes, whether joint probability distributions or deterministic functions, it would be the field of PDEs. We need to back neural networks with theorems and the mathematical rigor that eventually help with their design and architecture optimization. Neural network magical abilities need to be under the lens of analysis, and tools from the analysis of PDEs and their solutions are one promising way forward. Examples include Sobolev training (Czarnecki et al. 2017).

偏微分方程中的其他注意事项

Other Considerations in Partial Differential Equations

对于大多数当时我们远离了著名的偏微分方程,同时强调了本章的主题,以强调这些主题的适用范围远不止于经过充分研究的微分方程和应用。本科课程主要讨论仅涉及两个变量 ( x , y ) 或 ( x , t )的函数的线性偏微分方程。学生要么被误导,认为这就是最重要的,要么想知道:非线性偏微分方程怎么样?以及所有高维应用程序?偏微分方程组?这些课程还倾向于关注热方程(抛物线)、波动方程(双曲)、拉普拉斯方程(椭圆)以及一些数值解和模拟(有限差分、有限元和蒙特卡罗)。它们以最简单的形式呈现:线性、一维、二维或三维,在具有规则几何形状的域上定义,给本科生留下错误的印象,认为这些偏微分方程是应用程序中可能出现的所有方程的基础。它们还人为地划分了方程类型:椭圆方程、抛物线方程和双曲方程,就好像有一个完整的理论涵盖了每种类型。解析解法比较狭隘,只关注简单解的叠加原理(因为线性),这就导致了傅里叶级数和变换(这其实是一件非常好的事情)。神经网络拓宽了范围,使用简单函数的组合而不是加法来近似非线性方程的解。

For most of the time we stayed away from the famous partial differential equations while highlighting the themes of this chapter to stress the fact that these themes apply to much more than the well-studied differential equations and applications. Undergraduate courses primarily address linear PDEs involving only functions of two variables (x,y) or (x,t). Students are left either misled, thinking that is all that matters, or wondering: what about nonlinear PDEs? And all the high-dimensional applications? Systems of PDEs? These courses also tend to focus on the heat equation (parabolic), wave equation (hyperbolic), Laplace equation (elliptic), and some numerical solutions and simulations (finite differences, finite elements, and Monte Carlo). These are presented in their simplest forms: linear, one-, two-, or three-dimensional, defined on domains with regular geometries, giving undergraduate students the false impression that these PDEs are the base for all equations that might appear in applications. They also provide an artificial division between types of equations: elliptical, parabolic, and hyperbolic, as if there is a complete theory that encompasses each type. Analytical solution methods are narrow, focusing only on the principle of superposition of simple solutions (because of linearity), which leads to Fourier series and transforms (this is in fact a very good thing). Neural networks broaden the scope, in the sense of approximating solutions of nonlinear equations using compositions of simple functions as opposed to additions.

本科 PDE 课程的设置方式很精彩,但它们并没有真正反映 PDE 的现实——没有理论,没有数值,甚至没有广泛的适用性。学生毕业时感觉,给定一个全新的 PDE,他们不知道如何处理它,因为它不适合他们在 PDE 入门课程中学到的任何内容(当然,我可以告诉你首先如何处理它)谷歌搜索后:离散化并模拟它;这将使您深入了解其解决方案的行为)。

The way undergraduate PDE courses are set up is wonderful, but they do not truly reflect the reality of PDEs—not theoretically, not numerically, not even their wide applicability. Students graduate feeling that given a brand-new PDE, they have no idea what to do with it, because it doesn’t fit anything that they learned in an introductory PDE class (I can tell you what to first do with it, of course after googling it: discretize it and simulate it; this will give you tremendous insight into the behavior of its solution).

也就是说,有一些通用的方法来思考将偏微分方程组合在一起(建模),这几乎总是与物理学中的一些守恒定律相关,并进行分析(理论存在性、唯一性以及解和弱解的敏感性分析) ,并通过分析或数值方法(表示公式、格林函数、变换方法和数值)找到实际解决方案。

That said, there are general ways of thinking about putting the PDEs together (modeling), which is almost always related to some conservation laws from physics, and going about their analysis (theory existence, uniqueness, and sensitivity analysis of solutions and weak solutions), and finding the actual solutions analytically or numerically (representation formulas, Green’s functions, transform methods, and numerics).

首先,每个研究领域都有自己的微分方程来模拟它所关心的现象,例如:

For starters, each area of study has its own differential equations that model the phenomena that it cares for, for example:

  • 在流体动力学中,我们研究纳维-斯托克斯方程(等等)。这是一个非线性偏微分方程组。纳维-斯托克斯偏微分方程考虑了流体的速度、压力、密度、应力、压缩性以及作用在其上的力。该方程表达了质量守恒和动量守恒。方程的解描述了粘性流体的运动。

  • In fluid dynamics, we study Navier-Stokes equations (among others). This is a nonlinear system of PDEs. Navier-Stokes PDEs take into account the velocity of the fluid, pressure, density, stresses, compressibility, and the forces acting on it. The equation expresses the conservation of mass and conservation of momentum. The solution of the equation describes the motion of the viscous fluid.

  • 在经济学和金融学中,我们研究布莱克-斯科尔斯方程(等等)。

  • In economics and finance, we study the Black-Scholes equation (among others).

  • 在种群动态中,我们研究 Lotka-Volterra 捕食者-被捕食者方程(等等)。

  • In population dynamics, we study the Lotka-Volterra predator-prey equations (among others).

  • 在广义相对论中,我们研究爱因斯坦场方程(等等)。

  • In general relativity, we study the Einstein field equations (among others).

当偏微分方程对随时间演化的现象进行建模时,可能存在更高的演化驱动因素,从而可以更深入地了解偏微分方程及其属性:能量减少的趋势。从数学上讲,当在偏微分方程的解上求值时,能量泛函的导数为负。我们学习了很多数学知识,了解这些能量泛函、它们的导数以及它们作用的函数空间。我们使用大量相对简单的能量估计来证明各种非线性偏微分方程解的存在性。通过能量方法研究偏微分方程的正确设置是Sobolov 函数空间。变分法涉及最大值或最小值(统称为极值)能量泛函的它是非线性偏微分方程理论的基础。表现为能量泛函极小值的偏微分方程称为欧拉-拉格朗日方程

When PDEs model phenomena that evolve with time, there could be higher drivers of the evolution, which provide more insight into the PDE solution and its properties: a tendency to decrease an energy. Mathematically, the derivative of an energy functional is negative when evaluated at the solution of the PDE. We learn a lot of math understanding these energy functionals, their derivatives, and the function spaces they act on. We use a lot of relatively easy energy estimates to prove existence of solutions to various nonlinear PDE. The correct setting to study PDEs via energy methods is the Sobolov function spaces. The calculus of variations is concerned with the maxima or minima (collectively called extrema) of energy functionals. It is fundamental for the theory of nonlinear PDEs. PDEs that appear as minimizers of energy functionals are called the Euler-Lagrange equation.

总结与展望

Summary and Looking Ahead

本章向我们介绍了与人工智能相关的偏微分方程。偏微分方程具有无与伦比的模拟自然和社会现象的能力。解锁他们的解决方案为许多领域开辟了许多可能性。我们强调了获得这些解决方案的许多困难,例如维数灾难、网格生成、噪声数据以及人工智能如何帮助解决这些问题。

This chapter introduced us to PDEs as they relate to AI. PDEs have an unparalleled ability to model natural and social phenomena. Unlocking their solutions opens up many possibilities for many fields. We highlighted many of the difficulties in obtaining these solutions, such as the curse of dimensionality, mesh generation, noisy data, and how AI helps address those.

还有很多工作要做。为了开发基于物理的智能机器,我们需要为可扩展和强大的系统构建新的框架、数据集、标准化基准和新的严格数学。

There is much more work to be done. For developing physics-informed intelligent machines, we need to build new frameworks, data sets, standardized benchmarks, and new rigorous mathematics for scalable and robust systems.

有许多重要的偏微分方程主题我们没有涉及,例如不适定反问题,我们需要通过部分或完全观察其解来学习偏微分方程的参数或初始数据。基于物理的神经网络对于解决此类问题非常有效且高效。

There are many important PDE topics that we did not touch on, such as ill-posed inverse problems, where we need to learn the PDE’s parameters or initial data from partially or fully observing its solution. Physics-informed neural networks are effective and efficient for these kinds of problems.

在 Hamilton-Jacobi-Bellman 方程中,我们只是随意提到了粘度解和 Hopf-Lax 公式。在偏微分方程的存在方法中,除了定点迭代之外,还有针对某些偏微分方程类型的最小-最大方法。例如,我们没有提到椭圆偏微分方程和抛物线偏微分方程的单调性或最大值原理。

In the context of the Hamilton-Jacobi-Bellman equation, we mentioned viscosity solutions and Hopf-Lax formulas only casually. In the context of existence methods for PDEs, other than the fixed point iteration, there are mini-max methods for certain PDE types. For example, we did not mention monotonicity or the maximum principle for elliptic and parabolic PDEs.

本章结束时,我们要思考一个问题:“偏微分方程会推动我们走向更智能的智能体吗?” 并附有一篇文章作者阐述案例的对于基于物理的机器学习(Karniadakis et al. 2021),我们将神经网络与物理定律相结合,以充分利用两个世界的优点,并缓解许多科学环境中大型数据集或噪声数据的缺乏。以下是该文章的引用:

We leave this chapter with a question to ponder: “will PDEs advance us toward more intelligent agents?” and with an article where the authors make the case for physics-informed machine learning (Karniadakis et al. 2021), where we merge neural networks with physical laws to leverage the best of both worlds, and mitigate the lack of large data sets, or noisy data, in many scientific settings. The following is a quote from the article:

可以根据通过执行物理定律(例如,在连续时空域中的随机点)获得的附加信息来训练此类网络。这种基于物理的学习集成了(噪声)数据和数学模型,并通过神经网络或其他基于内核的回归网络来实现它们。此外,还可以设计专门的网络架构来自动满足一些物理不变量,以获得更好的准确性、更快的训练和改进的泛化能力。

Such networks can be trained from additional information obtained by enforcing the physical laws (for example, at random points in the continuous space-time domain). Such physics-informed learning integrates (noisy) data and mathematical models, and implements them through neural networks or other kernel based regression networks. Moreover, it may be possible to design specialized network architectures that automatically satisfy some of the physical invariants for better accuracy, faster training and improved generalization.

第 14 章人工智能、伦理、数学、法律和政策

Chapter 14. Artificial Intelligence, Ethics, Mathematics, Law, and Policy

对数据进行足够的折磨,它就会承认任何事情。

诺贝尔奖获得者、经济学家罗纳德·科斯(1910-2013)

Torture the data enough and it will confess to anything.

Nobel Laureate and economist Ronald Coase (1910–2013)

人工智能伦理这是一个广泛而深刻的话题,它正在作为哲学和人工智能领域交叉的一个新领域而出现。在本章中,我们只能触及表面,强调一些问题和解决这些问题的可能方法,但忽略了许多同样重要的问题。尽管如此,本章有一个信息我不想让你错过:

AI ethics is a wide and deep topic, and it is emerging as a new area at the intersection of the philosophy and AI fields. We can only scratch the surface in this chapter, highlighting some issues and possible ways to address them, but leaving many equally important ones out. Nevertheless, this chapter has a message that I don’t want you to miss:

我们需要更多人参与人工智能和政策领域。

We need more of us to be situated in both AI and policy.

在我从数学到人工智能应用的学习过程中,我发现人工智能不应该与政策分离,两者应该共同发展。我可以坐下来写下数百万个与人工智能技术相关的道德考虑的例子,例如数据安全、隐私、监视、民主、言论自由、劳动力考虑、公平、公正、偏见、歧视、包容性、透明度、监管,以及武器化的人工智能,但这不是我处理这个主题的方式。我对这些问题的看法略有不同,我亲眼目睹了人们如何在饱受战争蹂躏的地区尝试新武器,但政府和媒体却否认、不发表评论或说这些不幸事件是错误的,他们将受到调查,然后所有人都会转向更好的事情。当一项新技术大规模地影响人们时,开发该技术的人是最有资格了解其后果(无论好坏)的人。因此,他们应该直接与政策制定者合作来规范其使用。此外,如果有一项技术或一个事件对社会造成巨大破坏,我们可以促使人们思考、写作和遵守政策。巨大的颠覆不是人工智能本身,也不是人类目前产生和拥有的数据量,例如 Facebook 所拥有的数据、NASA 太空调查、人类基因组计划或我们的 Apple Watch 所拥有的数据,而是金钱对这项技术的投入,更重要的是公众的关注。

In my learning journey from math to its applications in AI, I discovered that AI should not be disentangled from policy, and that the two should evolve together. I can sit and write about the million examples where there are ethical considerations associated with AI technology, such as data security, privacy, surveillance, democracy, freedom of expression, workforce considerations, equity, fairness, bias, discrimination, inclusivity, transparency, regulation, and weaponized AI, but this is not how I will approach this subject. My take on these issues is from a slightly different angle, where I have seen firsthand how people try new weapons on populations in war-torn areas, and yet the governments and the media deny, do not comment, or say the unfortunate events were mistakes, that they will be investigated, then all move on to better things. When there is a new technology that affects people at scale, the people developing the technology are the ones most qualified to know its ramifications, both good and bad. So they are the ones who should collaborate directly with policy makers to regulate its usage. Moreover, if there is a technology or an event that causes a massive disruption to society, we can thrust people into thinking, writing, and complying with policy. The massive disruption is not AI per se, or the amount of data that humans currently produce and own, such as the data owned by Facebook, by NASA surveys of space, the Human Genome project, or our Apple Watches, it is the money that is invested into this technology, and more importantly, the public attention.

我生活在一个小小的、完美的数学泡沫中,那里的事物只能是黑白分明、合乎逻辑且正确的,如果我们不明白某些数学是如何运作的,我们总是可以说服自己,只要我们花时间,我们就能学会它。多一点时间。让我大开眼界的是与我们市消防部门和交通部门的合作。当我的学生在市政厅向城市官员、公共安全领导人和政策制定者做演讲时,我意识到,作为处理他们的数据的技术专家,我们有能力告诉他们,我们的数学模型可以做任何事情,无论这些模型是否做了或没有做。这种认识对我来说非常可怕。我不是受过训练的政策人员,我是数学人员,但我决定我必须进入政策领域。我融入了小型决策场所,积累了一些政策专业知识(重新制定大学的招聘政策;担任学院理事会主席;担任学术政策委员会主席;担任大学指导委员会成员;开设数据、政策和外交课程;在欧洲制定一个关于现代战争中人类安全、技术和创业的夏季项目;并就该主题举办演讲和研讨会)。

I was living in a little and perfect math bubble, where things can only be black and white, logical and correct, and if we don’t understand how some math works, we can always convince ourselves that we can learn it if we just spend a bit more time on it. What opened my eyes was working with our city’s fire department and transportation department. When my students were presenting at city hall to city officials, public safety leaders, and policy makers, I realized that we, as technology specialists working with their data, had the power to tell them that our math models could do anything, whether these models did that or not. This realization was very scary for me. I am not a policy person by training, I am a math person, but I decided that I must go into policy. I inserted myself into small policy-making venues to build up some policy expertise (redoing hiring policy at my university; chairing the college council; chairing the academic policies committee; sitting on my university’s steering committee; running a data, policy, and diplomacy class; developing a summer program in Europe on human security, technology, and entrepreneurship in the face of modern warfare; and giving talks and workshops on the subject).

我了解到政策不像数学。这里存在很多灰色地带和利益冲突,踏入危险的水域是另一回事。我了解了制定新政策的复杂性及其与现有政策的交叉点。这与人工智能系统没有什么不同,在人工智能系统中,持续更新和一致性至关重要,同时保持高效,不让我们自己和我们的系统陷入瘫痪。

I have learned that policy is not like math. There are a lot of gray areas and conflicting interests, and treading its treacherous waters is a different game. I learned about the complexity of establishing new policies, and their intersections with existing policies. This is not unlike an AI system, where constant updates and consistency are of paramount importance, while at the same time staying efficient and not working ourselves and our systems down to a paralysis.

政策要力求简洁具体。任何可能影响数百万人的技术都必须由自己的专家开发,他们的意识和态度类似于应急响应团队,思考最坏的情况并防范它们。目前的情况是,世界领先的科技公司正在加速人类迈向一个新的、互联的、人工智能驱动的世界,而政策和监管正在迎头赶上。然而,人工智能仍处于成熟阶段,因此现在是设计使其服务于公共利益的政策的理想时机。技术发展并不是偶然发生在我们身上的事情。我们不应该只是被动的参与者、接受者或消费者,尤其是因为我们自己就是数据:我们的互联网习惯、社交媒体帖子、银行交易、医疗记录、血液测试、核磁共振扫描、杂货店经营、优步乘车、家庭恒温器偏好、视频游戏技能、乘坐公交车、Apple Watch 步数和心率计数、驾驶制动和加速模式——我们的整个生活。这些数据被数字化并存储在随机位置的一些随机建筑物的数据仓库中。与 FICO 信用评分中受到严格监管的金融数据不同,当今的大多数数字数据都不受监管。一家公司可以将其出售给另一家公司,尽管其不准确,新公司将根据这些不受监管的数据建立模型并做出决策。某人的驾驶习惯是否会影响他们是否进入某个地方的某所大学?或者确定他们的医疗保险保费的定价?他们的日常通勤经过一个不太富裕的社区怎么样?十年前从某人的记录中清除的轻微罪行怎么样?它是否已从所有数据集中清除,包括多年前出售给其他公司的数据集?这是否仍然影响改变生活和生计的决定,例如贷款、大学录取、保险费和工作机会?谁知道?这是不受监管的。当我们选择与一家公司共享我们的数据时,是否有法律禁止将此数据共享或转售给其他公司用于其他用途?

We must strive for concise and specific policies. Any technology with the potential to affect millions must be developed by its own experts with awareness and an attitude similar to that of emergency response teams, thinking of worst-case scenarios and guarding against them. The current state is that the world’s leading technology companies are accelerating humanity toward a new, connected, and AI-powered world, while policy and regulation are playing catch-up. AI, however, is still maturing, so now is an ideal time to design policies that gear it toward the public good. Technological development is not some random thing that just happens to us. We should be more than passive participants, recipients, or consumers, especially since we ourselves are the data: our internet habits, social media posts, banking transactions, medical records, blood tests, MRI scans, grocery store runs, Uber rides, home thermostat preferences, video game skills, bus rides, Apple Watch step and heart rate counts, driving brake and acceleration patterns—our entire lives. These are digitized and stored in data warehouses in some random buildings in random locations. Unlike financial data that goes into our FICO credit score, which is heavily regulated, most of today’s digital data is unregulated. One company can sell it to another, with all its inaccuracies, and the new company will build models and make decisions based on this unregulated data. Are someone’s driving habits affecting whether they get into a certain college somewhere? Or determining the pricing of the premiums of their medical insurance? How about their daily commute that passes through a less affluent neighborhood? How about that minor offense that was cleared from someone’s record a decade ago? Did it get cleared from all data sets, including those that were sold to other companies years ago? Is that still affecting life-changing and livelihood decisions such as loans, college acceptances, insurance premiums, and job offers? Who knows? It is unregulated. When we opt into sharing our data with one company, are there laws that prohibit sharing or reselling this data to other companies for other uses?

我们可以永远利用我们的海量数字数据,但如果没有明智、有效的政策和监管,我们就无法指望这一点。

We can use our massive digital data for good, but we cannot bank on that without smart and effective policy and regulation.

好的人工智能

Good AI

好的人工智能应足够值得信赖,可以在公共和私营部门部署和使用。该领域倾向于花费大量时间来定义术语,例如可解释性、可解释性(显然这两个是不同的)、公平性、公平性等等。我认为这种对词汇的过度关注会分散注意力。最终目标更重要:

Good AI should be trustworthy enough to be deployed and used in the public and private sectors. There is a tendency in the field to spend a lot of time defining terms such as explainability, interpretability (apparently these two are different), fairness, equity, and many others. I see this hyperfocus on vocabulary as a distraction. The end goal is more important:

我们需要信任我们的系统,并让需要使用它们的人可以访问和理解它们

We need to trust our systems and make them accessible and understandable to those who need to use them.

为此,我们需要人工智能及其所服务和构建的数据:

For this, we need our AI and the data that it serves and is built on top of to be:

安全的
Secure

我们随着我们系统的发展,必须不断维护和更新物理和软件安全协议。云计算引入了新的安全要求,因为现在我们的数据和计算都不会发生在本地计算机附近的任何地方。

We have to keep maintaining and updating the physical and software security protocols as our systems evolve. Cloud computing has introduced a new layer of security requirements, since nowadays neither our data nor the computations happen anywhere in the vicinity of our local machines.

私人的
Private

正式的许多应用领域已经制定了隐私概念和标准。在谁拥有数据以及人工智能系统可以将数据用于什么目的方面,还有很多工作要做。我在这里补充的是透明度和信息共享。当我们对系统打算如何处理某些数据(例如用于发现新药或创建个性化治疗计划的医疗数据)保持透明时,人们可能会选择共享他们的数据。目前,技术生产者和技术消费者之间存在着一种犹豫和不信任的文化。我们可以通过传播知识并分享最终目标以及成功和不成功的结果来修正这一点。

Formal privacy notions and standards are already in place for many application sectors. There is a lot more to be done in terms of who owns the data and for what purpose it can be used by an AI system. My addition here is transparency and information sharing. When we are transparent about what our system intends to do with certain data, such as medical data to discover new drugs or create personalized treatment plans, people may opt in to share their data. Right now there is a culture of hesitation and mistrust between technology producers and technology consumers. We can amend this by spreading the knowledge and sharing the end goals and both successful and unsuccessful results.

实现其设计目的和声称要做的事情
Accomplishes what it is built for and what it claims to do

有正式的方法可以检查代码是否正确,但我们需要更多的系统持续测试,包括边缘情况,并对系统的功能、限制和未经测试的区域保持透明。

There are formal methods that can check whether code is correct or not, but we need more in terms of continuous testing of the system, including edge cases, and being transparent with the system’s capabilities, limitations, and untested territories.

对扰动和噪声具有鲁棒性
Robust to perturbations and noise

对输入的小扰动不应产生输出变化较大。当决策依赖于人工智能系统的预测时,这些预测就不能是任意的。人工智能系统应该能够容忍其输入中的噪声,并且这种容忍度必须被量化。

Small perturbations to the input should not produce large changes in the output. When decision making relies on the predictions of an AI system, these predictions cannot be arbitrary. The AI system should be tolerant to noise in its inputs, and that tolerance must be quantified.

高效的
Efficient

人工智能系统的效率不言而喻。它们建立在速度、自动化以及管理大规模计算的能力之上,并考虑到比以往更多的影响变量。我们需要继续改进现有系统,并关注那些理论上可行但在实际部署中尚未有效的系统。

Efficiency for AI systems should go without saying. They are founded on the promise of speed, automation, and their ability to manage large-scale computations, taking into account more contributing variables than was ever possible before. We need to continue to improve existing systems and attend to those that work in theory but are not yet efficient for real-world deployment.

公平的
Fair

许多系统依赖有偏见的数据,这些数据会通过管道传递,然后通过不公平的决策表现出来。识别数据中的偏见并消除它们是朝着公平方向迈出的第一步。

Many systems rely on biased data that goes down the pipeline and then gets manifested with unfair decisions. Identifying biases in data and undoing them is a first step in the fairness direction.

许多用户可以访问和理解
Accessible and understandable to many users

什么时候一项新技术对社会有益,它需要变得易于使用和理解。应有意识地努力将其工业化、商业化,并解决弱势社会部门或社区的准入问题。

When a new technology is beneficial to society, it needs to be made accessible and easy to use and understand. Intentional efforts should be made to industrialize it, commercialize it, and address access issues to society sectors or communities that are disadvantaged.

透明的
Transparent

透明度数据源、模型功能、用例、限制和文档至关重要。当这些信息持续不断地出现时,人们通常对有缺陷的系统有更大的容忍度。明确传达。

Transparency with data sources, model capabilities, use cases, limitations, and documentation is paramount. People usually have more tolerance for faulty systems when this information is continuously and clearly communicated.

政策事项

Policy Matters

人工智能政策开始形成。其目的是利用和最大化人工智能的好处,同时防范其潜在危害。

AI policy is starting to take shape. It is aimed toward harnessing and maximizing AI’s benefits while guarding against its potential harms.

政策很重要并且会产生影响。一个例子是Clearview AI 及其隐私问题。Clearview AI 是一家美国公司,该公司使用从网络下载的数十亿个人照片数据库创建并向私营公司出售面部识别软件。最近(2022 年 5 月),它解决了一起诉讼,同意遵守伊利诺伊州隐私法,该法允许人们控制自己的生物识别数据。Clearview AI 将主要将其面部识别技术限制于执法部门和其他政府机构。

Policy matters and makes a difference. One example is Clearview AI and its issues with privacy. Clearview AI is the US company that created and sold to private companies a facial recognition software using a database of billions of personal photos downloaded from the web. Recently (May 2022), it settled a lawsuit, agreeing to comply with the state of Illinois privacy laws that give people control over their biometric data. Clearview AI will restrict its facial identification technology primarily to law enforcement and other government agencies.

另一个例子是海康威视及其监视问题。海康威视是一家中国公司,生产数百万个视频监控摄像头,用于 190 多个国家,用途广泛,从警察监控系统到婴儿监视器。由于与中国政府关系密切,该公司目前正面临美国政府的制裁。海康威视在建立中国庞大的警察监控系统方面发挥了作用,中国政府利用该系统来镇压新疆的穆斯林少数民族。美国财政部目前正在考虑将海康威视列入“特别指定国民和封锁人员名单”,该名单禁止名单上的任何人与美国政府、美国人或美国公司开展业务。而且这些实体或个人的资产还被美国封锁。

Another example is Hikvision and its issues with surveillance. Hikvision is a Chinese company that manufactures millions of video surveillance cameras used in more than 190 countries, for purposes ranging from police surveillance systems to baby monitors. The company is now facing sanctions from the US government due to its close ties with the Chinese government. Hikvision played a role in building China’s massive police surveillance system that the Chinese government used to oppress the Muslim minority groups in Xinjiang. The US Treasury is currently considering adding Hikvision to the Specially Designated Nationals and Blocked Persons List, which prohibits whomever is on this list from doing business with the US government, Americans, or US companies. Moreover, the assets of these entities or individuals are blocked by the US.

对于人工智能政策的有组织的努力,我们可以看看正在朝着这个方向形成的政府、政府间和全球人工智能倡议(贸易、就业和地缘政治变化)的治理:美国的国家人工智能倡议、欧盟的人工智能倡议人工智能道德准则草案、阿联酋人工智能部、英国艾伦图灵研究所、加拿大 CIFAR 人工智能主席计划、丹麦技术公约、日本工业化路线图 Society 5.0、法国健康数据中心、德国自动和互联驾驶道德委员会、印度#AIforAll 战略、中国人工智能全球治理计划等。

For organized efforts toward AI policy, one can look at governmental, intergovernmental, and global governance of AI initiatives (for trade, jobs, and geopolitical changes) that are taking shape in this direction: the United States’ National Artificial Intelligence Initiative, The EU’s Draft AI Ethics Guidelines, UAE’s Ministry of Artificial Intelligence, The Alan Turing Institute in the UK, Canada’s CIFAR AI Chairs Program, Denmark’s Technology Pact, Japan’s industrialization roadmap Society 5.0, France’s Health Data Hub, Germany’s Ethics Commission on Automated and Connected Driving, India’s #AIforAll strategy, China’s Global Governance of AI Plan, and others.

我们可以将人工智能相关政策分为:

We can categorize AI-related policies into:

  • 投资人工智能研究和培训员工
  • Investment into AI research and training the workforce
  • 标准和法规
  • Standards and regulation
  • 构建可靠、安全的数字数据基础设施
  • Building solid and secure infrastructures of digital data
对技能开发和技术工业化的投资
Investment in development of skills and in the industrialization of technologies

政府机构正在为人工智能研究、新的人工智能机构、劳动力培训以及早期科学、技术、工程和数学 (STEM) 教育、终身学习和技术开发分配资金。各国政府还鼓励人工智能技术的产业化和私营部门的采用。此外,政府正在各个部门投资数据驱动的举措和人工智能,以进行公共行政改革并使其运作更加高效和集中(政府中的人工智能)。

Government agencies are allocating funding for AI research, new AI institutions, workforce training and early science, technology, engineering, and math (STEM) education, lifelong learning, and technology development. Governments are also encouraging the industrialization of AI technologies and private sector uptake. Moreover, governments are investing in data-driven initiatives and AI in their various departments, for public administration reform and to make their operations more efficient and centralized (AI in the government).

法规和标准
Regulations and standards

法规和标准包括数据安全和使用、自动驾驶汽车等汽车人工智能以及武器化人工智能的法规和标准。

Regulations and standards include those for data security and usage, automotive AI such as self-driving cars, and weaponized AI.

数据和数字基础设施
Data and digital infrastructure

高质量数据对于人工智能按预期工作的能力至关重要。各国政府正在鼓励开放数据集并开发用于安全交换私人数据的平台。人们还有意努力消除人工智能算法中的偏见数据集。

High quality data is central to the ability of AI to work as intended. Governments are encouraging open data sets and developing platforms for the secure exchange of private data. There are also intentional efforts to remove bias from AI algorithms and data sets.

可能会出现什么问题?

What Could Go Wrong?

什么时候设计一个新系统或分析一个现有系统,我们的指导性问题之一必须是:什么可能会出错?随之而来的是检查点列表:

When designing a new system or analyzing an existing one, one of our guiding questions must be: what could go wrong? With this comes a list of checkpoints:

  • 该系统的目的是什么?

  • What is the system intended to do?

  • 它使用什么数据进行训练?数据是如何收集的?噪音和缺失值是如何处理的?

  • What data did it train on? How was the data collected? How were the noise and missing values dealt with?

  • 哪些人在数据中代表性不足?

  • Who can be mostly underrepresented within the data?

  • 它使用什么算法?

  • What algorithms does it use?

  • 算法的决策阈值是多少?

  • What are the algorithms’ thresholds for decision making?

  • 考虑到这些阈值,这些算法决策对谁的伤害最大?

  • Given these thresholds, who can be harmed the most by these algorithmic decisions?

在本节中,我们将举几个例子(其中有很多例子),这些例子强调了可能出错的事情,我们必须防范或尝试标准化和监管。

In this section, we sample a few examples (among many) that highlight the things that can go wrong and that we must either guard against or try to standardize and regulate.

从数学到武器

From Math to Weapons

一球本书的目的是强调人工智能模型的数学基础。考虑到许多武器(例如原子弹)的发展历史,从数学到武器的转变并不新鲜。这一贡献不仅是在一个方向上:军事和国防战略和目标影响了整个数学领域的发展,例如动态规划,它最初解决了训练或后勤的军事调度,以及优化各种资源的分配。

One goal of this book is to highlight the mathematical foundations of AI models. The transition from math to weapons is not new, given the development history of many weapons (e.g., the atomic bomb). This contribution is not only in one direction: military and defense strategies and goals have influenced the development of entire math fields, such as dynamic programming, which initially addressed military scheduling for training or logistics, and optimizing the allocation of various resources.

Cathy O'Neil 所著的《数学毁灭性武器》 (Crown 2017)一书超越了军事武器化的范围,并通过一个又一个的例子列出了数学算法的许多有害影响,而我们的社会目前依赖数学算法来做出重大且改变生活的决策。本书最后一章的前几段值得全文引用,因为它们揭示了看似不同领域中部署的算法如何相互作用并影响结果的复杂方式。他们还揭示了完全相同的算法如何以截然不同的方式影响不同的人群。

The book Weapons of Math Destruction by Cathy O’Neil (Crown 2017) goes beyond military weaponization and lists with example after example the many harmful effects of the mathematical algorithms that our society currently relies on for highly consequential and life-altering decisions. The first few paragraphs of the book’s last chapter are worth quoting in full, since they reveal the intricate ways the algorithms deployed in seemingly different sectors interact with each other and influence outcomes. They also reveal how the exact same algorithms affect different populations in drastically different ways.

[…​] we’ve visited school and college, the courts and the workplace, even the voting booth. Along the way, we’ve witnessed the destruction caused by Weapons of Math Destruction. Promising efficiency and fairness, they distort higher education, drive up debt, spur mass incarceration, pummel the poor at nearly every juncture, and undermine democracy. It might seem like the logical response is to disarm these weapons, one by one. The problem is that they’re feeding on each other. Poor people are more likely to have bad credit and live in high-crime neighborhoods, surrounded by other poor people. Once the dark universe of Weapons of Math Destruction digests that data, it showers them with predatory ads for subprime loans or for-profit schools. It sends more police to arrest them, and when they’re convicted it sentences them to longer terms. This data feeds into other Weapons of Math Destruction, which score the same people as high risks or easy targets and proceed to block them from jobs, while jacking up their rates for mortgages, car loans, and every kind of insurance imaginable. This drives their credit rating down further, creating nothing less than a death spiral of modeling. Being poor in a world of Weapons of Math Destruction is getting more and more dangerous and expensive.

The same Weapons of Math Destruction that abuse the poor also place the comfortable classes of society in their own marketing silos. They jet them off to vacations in Aruba and wait-list them at Wharton. For many of them, it can feel as though the world is getting smarter and easier. Models highlight bargains on prosciutto and chianti, recommend a great movie on Amazon Prime, or lead them, turn by turn, to a café in what used to be a “sketchy” neighborhood. The quiet and personal nature of this targeting keeps society’s winners from seeing how the very same models are destroying lives, sometimes just a few blocks away.

[…​] we’ve visited school and college, the courts and the workplace, even the voting booth. Along the way, we’ve witnessed the destruction caused by Weapons of Math Destruction. Promising efficiency and fairness, they distort higher education, drive up debt, spur mass incarceration, pummel the poor at nearly every juncture, and undermine democracy. It might seem like the logical response is to disarm these weapons, one by one. The problem is that they’re feeding on each other. Poor people are more likely to have bad credit and live in high-crime neighborhoods, surrounded by other poor people. Once the dark universe of Weapons of Math Destruction digests that data, it showers them with predatory ads for subprime loans or for-profit schools. It sends more police to arrest them, and when they’re convicted it sentences them to longer terms. This data feeds into other Weapons of Math Destruction, which score the same people as high risks or easy targets and proceed to block them from jobs, while jacking up their rates for mortgages, car loans, and every kind of insurance imaginable. This drives their credit rating down further, creating nothing less than a death spiral of modeling. Being poor in a world of Weapons of Math Destruction is getting more and more dangerous and expensive.

The same Weapons of Math Destruction that abuse the poor also place the comfortable classes of society in their own marketing silos. They jet them off to vacations in Aruba and wait-list them at Wharton. For many of them, it can feel as though the world is getting smarter and easier. Models highlight bargains on prosciutto and chianti, recommend a great movie on Amazon Prime, or lead them, turn by turn, to a café in what used to be a “sketchy” neighborhood. The quiet and personal nature of this targeting keeps society’s winners from seeing how the very same models are destroying lives, sometimes just a few blocks away.

Note that the math is correct and exactly the same for both sectors of the society, but what changed is the input to the model. Recall that if we wanted to sum this whole book into one math sentence, it would be: the features of the input to an AI model determine the final output. Poor and rich populations, for lack of better terms, have different features, so they get different outcomes. Our algorithms are fair in this sense, computing exactly what they are supposed to compute. I am not a fan of presenting a problem without proposing solutions, or at least ideas for solutions. Maybe an initial way to improve the current situation is to train our algorithms separately using data from different groups of populations, so that a person’s poverty will not be a contributing factor in the algorithms’ decision about their trustworthiness to pay back a certain loan, but other real factors will be.

Note that the math is correct and exactly the same for both sectors of the society, but what changed is the input to the model. Recall that if we wanted to sum this whole book into one math sentence, it would be: the features of the input to an AI model determine the final output. Poor and rich populations, for lack of better terms, have different features, so they get different outcomes. Our algorithms are fair in this sense, computing exactly what they are supposed to compute. I am not a fan of presenting a problem without proposing solutions, or at least ideas for solutions. Maybe an initial way to improve the current situation is to train our algorithms separately using data from different groups of populations, so that a person’s poverty will not be a contributing factor in the algorithms’ decision about their trustworthiness to pay back a certain loan, but other real factors will be.

Chemical Warfare Agents

Chemical Warfare Agents

The destructive potential of AI models can manifest itself even with the models that are geared toward the utmost benefit to humanity: generative AI models for drug discovery. The ease with which bad actors can misuse the models is alarming. All a bad actor needs to do is to learn how the model works. First, the model maps the structure of a molecule to the way it acts in the body, then it optimizes for those molecules that maximize benefit and minimize toxicity. A bad actor can retrain the model, reversing its optimization objective from minimizing toxicity to maximizing toxicity. Mathematically, this is as simple as reversing the sign of the objective function in an optimization problem. This is the point that Fabio Urbina and his colleagues at Collaborations Pharmaceuticals recently highlighted about their work. To make this point, the team retrained their model with this malicious objective. In only 6 hours, the model generated 40,000 toxins, some of them actual chemical warfare agents that weren’t in the initial data set.

The destructive potential of AI models can manifest itself even with the models that are geared toward the utmost benefit to humanity: generative AI models for drug discovery. The ease with which bad actors can misuse the models is alarming. All a bad actor needs to do is to learn how the model works. First, the model maps the structure of a molecule to the way it acts in the body, then it optimizes for those molecules that maximize benefit and minimize toxicity. A bad actor can retrain the model, reversing its optimization objective from minimizing toxicity to maximizing toxicity. Mathematically, this is as simple as reversing the sign of the objective function in an optimization problem. This is the point that Fabio Urbina and his colleagues at Collaborations Pharmaceuticals recently highlighted about their work. To make this point, the team retrained their model with this malicious objective. In only 6 hours, the model generated 40,000 toxins, some of them actual chemical warfare agents that weren’t in the initial data set.

It is easy to conclude here that we need to be intentional, deliberate, introspective, and all kinds of adjectives on how to guard against this, without explicitly clarifying how, because the reality is that this is a complex issue. But how do we guard against this? My personal opinion here is we should approach this the same way we guard against mass destruction weapons in the non-AI world. No one can guarantee that bad players will not get their hands on the technology, but our job is to make it very difficult for them to develop it into deployable weapons.

It is easy to conclude here that we need to be intentional, deliberate, introspective, and all kinds of adjectives on how to guard against this, without explicitly clarifying how, because the reality is that this is a complex issue. But how do we guard against this? My personal opinion here is we should approach this the same way we guard against mass destruction weapons in the non-AI world. No one can guarantee that bad players will not get their hands on the technology, but our job is to make it very difficult for them to develop it into deployable weapons.

AI and Politics

AI and Politics

The role of TikTok, Facebook, and other social media platforms in politics is hard to overstate. They have already affected election results and overturned governments. Bots can generate fake news, history, reviews, comments, pages, tweets, and spread misinformation for political purposes. There are ways social media companies are trying to battle this problem, with multifaceted approaches utilizing both machine learning to detect fraud or identify nodes spreading misinformation, employing third-party fact-checking organizations, working on better ranking algorithms for users’ news feeds, and other ways, with mixed results due to the scale at which these companies operate, and sometimes due to the conflict of interest between the companies’ profitable objectives and ethics departments.

The role of TikTok, Facebook, and other social media platforms in politics is hard to overstate. They have already affected election results and overturned governments. Bots can generate fake news, history, reviews, comments, pages, tweets, and spread misinformation for political purposes. There are ways social media companies are trying to battle this problem, with multifaceted approaches utilizing both machine learning to detect fraud or identify nodes spreading misinformation, employing third-party fact-checking organizations, working on better ranking algorithms for users’ news feeds, and other ways, with mixed results due to the scale at which these companies operate, and sometimes due to the conflict of interest between the companies’ profitable objectives and ethics departments.

Personalized political campaigns, where the same politician caters to different ideologies based on who their targeted audience is, without the audience ever knowing that this is the case, is a real danger that can undermine democracies. Moreover, based on new information about whether a certain state is swinging to the left or right, more funds can be allocated to target voters (again with personalized news feeds, and political ads catering only to their preferred views based on their historical preferences along with those of their friends), to swing their votes in highly competitive battlegrounds. This can happen in real time and affect the outcomes of entire elections. This has always been the case in politics, but again, in the digital era, this happens at scale, in real time, and with relatively no more effort than targeted deployment of algorithms backed with a giant database of our preferences and what makes us tick, click, pay, volunteer, or elect.

Personalized political campaigns, where the same politician caters to different ideologies based on who their targeted audience is, without the audience ever knowing that this is the case, is a real danger that can undermine democracies. Moreover, based on new information about whether a certain state is swinging to the left or right, more funds can be allocated to target voters (again with personalized news feeds, and political ads catering only to their preferred views based on their historical preferences along with those of their friends), to swing their votes in highly competitive battlegrounds. This can happen in real time and affect the outcomes of entire elections. This has always been the case in politics, but again, in the digital era, this happens at scale, in real time, and with relatively no more effort than targeted deployment of algorithms backed with a giant database of our preferences and what makes us tick, click, pay, volunteer, or elect.

Unintended Outcomes of Generative Models

Unintended Outcomes of Generative Models

Large generative language models and text-to-image models are trained on internet-scale data that inherits internet-scale social biases, discrimination, and harmful content. This is best illustrated with Imagen’s section on the limitations of its text-to-image model generating high-resolution images from text captions:

Large generative language models and text-to-image models are trained on internet-scale data that inherits internet-scale social biases, discrimination, and harmful content. This is best illustrated with Imagen’s section on the limitations of its text-to-image model generating high-resolution images from text captions:

[…​] the data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scraped data sets. While this approach has enabled rapid algorithmic advances in recent years, data sets of this nature often reflect social stereotypes, oppressive viewpoints, and derogatory, or otherwise harmful, associations to marginalized identity groups. While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M data set which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes. Imagen relies on text encoders trained on uncurated web-scale data, and thus inherits the social biases and limitations of large language models. As such, there is a risk that Imagen has encoded harmful stereotypes and representations, which guides our decision to not release Imagen for public use without further safeguards in place. […​] Imagen, may run into danger of dropping modes of the data distribution, which may further compound the social consequence of data set bias. Imagen exhibits serious limitations when generating images depicting people. Our human evaluations found Imagen obtains significantly higher preference rates when evaluated on images that do not portray people, indicating a degradation in image fidelity. Preliminary assessment also suggests Imagen encodes several social biases and stereotypes, including an overall bias toward generating images of people with lighter skin tones and a tendency for images portraying different professions to align with Western gender stereotypes. Finally, even when we focus generations away from people, our preliminary analysis indicates Imagen encodes a range of social and cultural biases when generating images of activities, events, and objects. We aim to make progress on several of these open challenges and limitations in future work.

[…​] the data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scraped data sets. While this approach has enabled rapid algorithmic advances in recent years, data sets of this nature often reflect social stereotypes, oppressive viewpoints, and derogatory, or otherwise harmful, associations to marginalized identity groups. While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M data set which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes. Imagen relies on text encoders trained on uncurated web-scale data, and thus inherits the social biases and limitations of large language models. As such, there is a risk that Imagen has encoded harmful stereotypes and representations, which guides our decision to not release Imagen for public use without further safeguards in place. […​] Imagen, may run into danger of dropping modes of the data distribution, which may further compound the social consequence of data set bias. Imagen exhibits serious limitations when generating images depicting people. Our human evaluations found Imagen obtains significantly higher preference rates when evaluated on images that do not portray people, indicating a degradation in image fidelity. Preliminary assessment also suggests Imagen encodes several social biases and stereotypes, including an overall bias toward generating images of people with lighter skin tones and a tendency for images portraying different professions to align with Western gender stereotypes. Finally, even when we focus generations away from people, our preliminary analysis indicates Imagen encodes a range of social and cultural biases when generating images of activities, events, and objects. We aim to make progress on several of these open challenges and limitations in future work.

How to Fix It?

How to Fix It?

Awareness of harmful, biased, unfair, intrusive, and weaponized AI has risen in the past few years, and efforts are ongoing to address these issues. The following are examples of such efforts.

Awareness of harmful, biased, unfair, intrusive, and weaponized AI has risen in the past few years, and efforts are ongoing to address these issues. The following are examples of such efforts.

Addressing Underrepresentation in Training Data

Addressing Underrepresentation in Training Data

One theme 不断重复出现的是用于训练人工智能模型的数据的质量。许多偏见的出现是由于非主导群体在大数据集中的代表性不足,包括他们的文化价值观或语言。为了使人工智能惠及所有人,一种解决方案是确保数据由自己的人标记。例如,智能智慧之声人工智能项目(现已结束)在 2021 年举办了一次数据标记研讨会,美洲原住民在会上重新标记了与其文化相关的图像。其中许多图像被机器学习分类模型错误地标记。他们还创建了本地烹饪技术的知识图,以及用于查询知识图的聊天机器人。除了这些努力之外,人工智能还可以帮助保护即将灭绝的文化、历史和语言。

One theme that keeps repeating itself is the quality of the data that goes into training an AI model. Many biases appear because of the underrepresentation of nondominant groups, including their cultural values or languages, in large data sets. For AI to benefit everyone, one solution is to ensure that the data is labeled by its own people. For example, the Intelligent Voices of Wisdom AI project (which has now ended) led a data labeling workshop in 2021 where Native Americans relabeled images related to their culture. Many of these images had been wrongly labeled by machine learning classification models. They also created a knowledge graph of native culinary techniques, along with a chatbot to query the knowledge graph. Along with such efforts, AI can help preserve cultures, history, and languages that are about to go extinct.

解决词向量中的偏差

Addressing Bias in Word Vectors

自然语言处理的第一步是将语言的符号(例如单词)转换为携带单词语义的数字向量。在第 7 章中,我们了解到语言模型使用单词出现的文档中的上下文来构建这些单词向量。因此,词向量中嵌入的含义在很大程度上取决于用于训练模型的语料库的类型。语料库是我们生活的文化的产物。许多自由和公民权利相对较新,性别角色和性身份不再是我们预先确定的。许多用于训练语言模型的语料库都是基于互联网新闻文章、维基百科页面和其他仍然存在偏见、歧视性且包含有害刻板印象或内容的语料库。我们希望确保进入我们的人工智能模型的词向量不会强化歧视,并对女性和少数族裔造成不成比例的伤害。

One first step in natural language processing is converting a language’s symbols, such as words to vectors of numbers that carry the word’s semantics. In Chapter 7, we learned that language models construct these word vectors using a word’s context in the documents where it appears. So the meaning embedded in word vectors depends heavily on the type of corpus used to train the model. Corpuses are a product of the culture we live in. Many liberties and civil rights are relatively recent, and gender roles and sexual identities are no longer predetermined for us. Many corpuses that are used for training language models are based on internet news articles, Wikipedia pages, and others that are still biased, discriminatory, and contain harmful stereotypes or content. We want to make sure that the word vectors that make it into our AI model do not reinforce discrimination and disproportionately harm women and minorities.

例如,如果训练语料库(例如谷歌新闻文章)主要来自一个社会,其中女性担任护士或小学教师的人数过多,而男性担任医生或软件工程师的人数过多,那么词向量就会继承这种性别偏见。代表男性软件工程师的向量之间的距离 将小于女性软件工程师之间的距离。我们需要识别并补偿词向量中的此类偏差。

For example, if the training corpus (such as Google News articles) is mostly from a society where women are overrepresented as nurses or elementary school teachers and men are overrepresented as doctors or software engineers, then the word vectors would inherit this gender bias. The distance between the vector representing man and software engineer will be smaller than the distance between woman and software engineer. We need to identify and compensate for such biases in word vectors.

一种解决方案很好而且简单。鉴于我们正在处理数字向量,我们实际上可以从这些向量中减去性别偏见和其他偏见。因此,如果我们选择向另一个方向偏置,则可以通过减去代表manmale的向量来调整代表软件工程师的向量,并且可以添加代表womanfemale的向量。回想一下,当我们将词向量相互相加或相减时,获得的新向量仍然具有意义,因为向量中的每个条目都代表某些意义维度的某种强度。也就是说,如果我们从king的向量中减去male的向量,我们将得到一个接近queen的向量。

One solution is nice and simple. Given that we are dealing with vectors of numbers, we can literally subtract gender bias and other biases from these vectors. So the vector representing software engineer would be adjusted by subtracting the vectors representing man and male, and the vectors representing woman and female could be added, if we choose to bias in the other direction. Recall that when we add or subtract word vectors from each other, the new vectors obtained still carry meaning, since each entry in the vector represents some intensity in some meaning dimension. That is, if we subtract the vector for male from the vector for king, we would get a vector close to that of queen.

解决隐私问题

Addressing Privacy

隐私问题处于对大数据和人工智能担忧的最前沿。机器学习模型需要数据来训练,而这些数据包含真人的个人和敏感信息。此外,许多对私有数据的计算都发生在云端,这引发了更多的安全和隐私问题。

Privacy issues are at the forefront of the concerns about big data and AI. Machine learning models need data to train on, and this data contains personal and sensitive information of real people. Moreover, a lot of the computations on private data happens on the cloud, which raises even more security and privacy concerns.

如果匿名数据不可行,或者它降低了模型的性能(例如,年龄、体重、种族和性别信息对于医疗目的很重要),那么加密是我们的下一个选择。为此,我们需要能够直接对加密数据执行计算的模型。然而,传统的加密方案不允许对加密数据进行任何计算。解决方案是允许此操作的新加密方案。然后,安全设备可以加密数据,将加密的数据发送到在云中运行的机器学习模型,无需解密即可预测其结果,并将这些结果发送回安全设备,安全设备最终在本地解密,从而保护所有私有数据和同时利用云的优势。

If anonymizing data is infeasible, or if it lowers the performance of the model (for example, age, weight, race, and gender information are important for medical purposes), then encryption is our next option. For this, we need models that are able to perform computations directly on encrypted data. Traditional encryption schemes, however, do not allow any computations on encrypted data. The solution is new encryption schemes that allow this. Secure devices can then encrypt data, send this encrypted data to machine learning models operating in the cloud, predict their results without having to decrypt them, and send these results back to the secure devices, which finally decrypt them locally, securing all private data and at the same time taking advantage of the cloud.

同态加密正是这样做的。Kristin Lauter (MetaAI)的SIAM新闻文章,其研究领域是人工智能和密码学的交叉点,解释了同态加密,并列出了以下不错的应用:

Homomorphic encryption does exactly that. The SIAM news article by Kristin Lauter (MetaAI), whose research is at the intersection of AI and cryptography, explains homomorphic encryption, and lists the following nice applications:

一种云服务,以加密形式处理云中的所有锻炼、健身和位置数据。该应用程序在本地解密分析结果后在手机上显示汇总统计数据。

一种加密的天气预报服务,它采用加密的邮政编码并返回有关相关位置天气的加密信息,然后将其解密并显示在手机上。云服务永远不会了解用户的位置或返回的天气数据的详细信息。

私人医疗诊断应用程序:患者将加密版本的胸部X光图像上传到云服务。通过对云端的加密图像运行图像识别算法来诊断病情;诊断结果以加密形式返回给医生或患者。

A cloud service that processes all workout, fitness, and location data in the cloud in encrypted form. The app displays summary statistics on a phone after locally decrypting the results of the analysis.

An encrypted weather prediction service that takes an encrypted ZIP code and returns encrypted information about the weather at the location in question, which is then decrypted and displayed on the phone. The cloud service never learns the user’s location or the specifics of the weather data that was returned.

A private medical diagnosis application: The patient uploads an encrypted version of a chest X-ray image to the cloud service. The medical condition is diagnosed by running image recognition algorithms on the encrypted image in the cloud; the diagnosis is returned in encrypted form to the doctor or patient.

了解云和互联设备时代为确保数据安全和隐私所做的努力,会增加公众对系统的信任,以及他们自愿提供数据以增强这些技术的意愿。也就是说,任何使用过真实数据的人都知道,通过查看我们正在使用的数据可以学到很多东西。我不确定如何对加密数据进行故障排除锻炼。

Learning about the efforts toward ensuring the security and privacy of our data in the age of the cloud and connected devices increases the public’s trust in the systems and their willingness to volunteer their data to enhance these technologies. That said, as anyone who has worked with real data knows, there is a lot to learn from being able to see the data we are working with. I am not sure how troubleshooting on encrypted data can work out.

解决公平问题

Addressing Fairness

人类从直觉上认识到不公平。我们如何确保人工智能模型公平运行?一种方法是监控对哪些利益相关者伤害最大的模型(例如职位空缺的老年申请人,或有资格在刑事司法系统中假释的少数群体),然后研究解决方法,例如消除偏见培训数据,重新定义决策边界和阈值,包括人员参与其中,或者为帮助弱势群体的项目重新分配资源。

Humans recognize unfairness on an intuitive level. How do we make sure that AI models are operating fairly? One way is to monitor the models for which stakeholders they are harming the most (such as older applicants for job openings, or minorities eligible for parole in the criminal justice system), then working on ways to fix that, such as de-biasing the training data, redefining the decision boundaries and thresholds, including humans in the loop, or reallocating resources for programs that lift disadvantaged groups.

公平人工智能不仅仅与决策算法有关。公平性包括谁从算法中受益,例如,谁了解职位空缺、疫苗接种情况或教育机会。文章“Adversarial Graph Embeddings for Fair Influence Maximization over Social Networks”(Khajehnejad et al. 2020)将此视为社交媒体图中的公平影响力最大化问题。对于影响力最大化图模型,通常在选择具有最大影响力的节点和那些到达不一定与图中的大中心紧密连接的少数群体的节点之间进行权衡。因此,最终受影响的节点集在种族、性别、原籍国和其他属性方面通常分布不均匀。对抗网络通常适合训练存在竞争目标的模型。作者利用这一点,引入了对抗性图嵌入,其中有两个网络一起训练:用于图嵌入的自动编码器和用于识别敏感属性的鉴别器。这导致嵌入在敏感属性中的分布类似。然后,他们对生成的图嵌入进行聚类,以决定一个好的初始值种子集。

Fair AI does not only have to do with decision-making algorithms. Fairness includes who benefits from the algorithms, for example, who gets informed about a job opening, vaccination availability, or education opportunities. The article “Adversarial Graph Embeddings for Fair Influence Maximization over Social Networks” (Khajehnejad et al. 2020) poses this as a fair influence maximization problem in social media graphs. For influence maximization graph models, there is usually a trade-off between selecting the nodes that have the most influence and those that reach minority groups that are not necessarily strongly connected to the big hubs in the graph. Thus, the final set of influenced nodes is not usually fairly distributed with respect to race, gender, country of origin, and other attributes. Adversarial networks are usually good to train models where there are competing objectives. The authors take advantage of this, introducing adversarial graph embeddings, where there are two networks trained together: an auto-encoder for graph embedding and a discriminator to discern the sensitive attributes. This leads to embeddings that are similarly distributed across sensitive attributes. Then they cluster the resulting graph embeddings to decide on a good initial seed set.

将道德注入人工智能

Injecting Morality into AI

人工智能代理必须知道正确与错误的区别,并且最好具有足够的灵活性来处理道德的灰色地带。我们需要一个模型来模拟人类的道德判断及其所有情境变化和复杂性。Ask Delphi正是试图做到这一点。当我们问仍处于原型阶段的德尔福时,我们会问这样的问题:可以抢劫银行吗?不跟我老公说话可以吗?我们的询问和Delphi的答复都会被记录下来,以及我们是否同意Delphi的意见,以及我们改进Delphi答复的建议。随着越来越多的人使用 Delphi,训练数据得到增强,使 Delphi 能够学习更复杂的情况并做出更好的预测(道德判断)。以下摘录和免责声明来自德尔福网站。他们对模型的最新技术水平有着深刻的见解:

An AI agent has to know the difference between right and wrong, and ideally be flexible enough to handle the gray areas of morality. We need a model that emulates humans’ moral judgments with all their situational variations and complexities. Ask Delphi attempts to do exactly that. When we ask Delphi, which is still a prototype, questions such as: is it OK to rob a bank? Is it OK not to talk to my husband? both our queries and Delphi’s answers are recorded, as well as whether we agree with Delphi, and our suggestions to improve Delphi’s response. As more people engage with Delphi, the training data is enhanced, allowing Delphi to learn more complex situations and make better predictions (moral judgments). The following excerpts and disclaimers are from Delphi’s website. They are insightful about the model’s state of the art:

德尔福正在向 MTurk 上严格合格的人学习道德判断。仅问题中使用的情况是从 Reddit 中收集的,因为它是道德上有问题的情况的重要来源。Delphi 1.0.4 在种族相关陈述上的准确度为 97.9%,在性别相关陈述上的准确度为 99.3%。首次发布后,我们增强了 Delphi 1.0.0 对种族主义和性别歧视言论的防范,过去的准确率分别为 91.2% 和 97.3%。

条款和条件 (v1.0.4)

Delphi 是一个研究原型,旨在研究在各种日常情况下对人们的道德判断进行建模的承诺,更重要的是,研究其局限性。德尔福的目标是帮助人工智能系统更具道德意识和公平意识。通过朝这个方向迈出一步,我们希望激励我们的研究社区正面应对该领域的研究挑战,以构建道德、可靠和包容的人工智能系统。

德尔福有哪些局限性?大型预训练语言模型(例如 GPT-3)是在大多数未经过滤的互联网数据上进行训练的,因此非常快地产生有毒、不道德和有害的内容,尤其是关于少数群体的内容。德尔福的回答是根据对美国众包工作者的调查自动推断出来的,这有助于减少这个问题,但可能会引入其自身的偏见。因此,德尔福的一些回复可能包含不适当或令人反感的结果。分享前请注意结果。

Delphi is learning moral judgments from people who are carefully qualified on MTurk. Only the situations used in questions are harvested from Reddit, as it is a great source of ethically questionable situations. Delphi 1.0.4 demonstrates 97.9% accuracy on race-related and 99.3% on gender-related statements. After its initial launch, we enhanced Delphi 1.0.0’s guards against statements about racism and sexism, which used to show 91.2% and 97.3% accuracy.

Terms & Conditions (v1.0.4)

Delphi is a research prototype designed to investigate the promises and more importantly, the limitations of modeling people’s moral judgments on a variety of everyday situations. The goal of Delphi is to help AI systems be more ethically-informed and equity-aware. By taking a step in this direction, we hope to inspire our research community to tackle the research challenges in this space head-on to build ethical, reliable, and inclusive AI systems.

What are the limitations of Delphi? Large pretrained language models, such as GPT-3, are trained on mostly unfiltered internet data, and therefore are extremely quick to produce toxic, unethical, and harmful content, especially about minority groups. Delphi’s responses are automatically extrapolated from a survey of US crowd workers, which helps reduce this issue but may introduce its own biases. Thus, some responses from Delphi may contain inappropriate or offensive results. Please be mindful before sharing results.

人工智能的民主化和非专家的可及性

Democratization and Accessibility of AI to Nonexperts

最大化人工智能技术的优势在于,它们必须民主化,让广大民众能够轻松使用,而不是仅限于专家。为了实现这一点,并且为了让人们信任这些系统,他们所依赖的模型和数据系统必须易于理解、易于使用,并且其内部运作、功能和限制必须透明。

To maximize the benefits of AI technologies, they have to be democratized and made easily accessible to populations at large as opposed to being restricted to experts. For this to happen, and for people to trust these systems, the models and the data systems they rely upon must be understandable, easy to use, and transparent about their inner workings, capabilities, and limitations.

安娜·法里哈博士 (微软)是一位研究人员,为这一目标做出了出色的工作。她有兴趣扩展数据系统的功能,以提供面向用户的功能,帮助提高从最终用户到数据科学家和开发人员等不同用户群体的生产力和敏捷性。

Anna Fariha, Ph.D. (Microsoft) is one researcher doing wonderful work toward this goal. She is interested in extending the capabilities of data systems to provide user-facing functionalities that help boost productivity and agility for a diverse group of users, ranging from end users to data scientists and developers.

优先考虑高质量数据

Prioritizing High Quality Data

例子本章阐述了优先考虑、民主化和确保高质量数据以获得公平且造福于人类的人工智能的理由。高质量的数据清晰、准确、公正。它存储在易于查询的结构中。需要向最终用户解释数据结构之间的差异,以便他们能够决定哪种数据结构最适合他们。对于想要过渡到数据驱动决策、追随人工智能潮流或与这些技术融入其 DNA 的年轻公司保持竞争力的机构来说,确定一个以有组织且一致的方式处理数据的计划是第一步这对于未来的成功至关重要。

The examples in this chapter make the case for prioritizing, democratizing, and securing high quality data to obtain AI that is fair and beneficial to humanity. High quality data is clear, accurate, and impartial. It is stored in easy-to-query structures. The differences between data structures need to be explained to end users so they can decide which ones work best for them. For institutions that want to transition to data-driven decision making, get on the AI bandwagon, or stay competitive with younger companies where these technologies are built into their DNA, determining a plan to handle their data in an organized and consistent way is one step that is crucial for future success.

在与我们城市的消防部门和公共交通部门的合作中,我们发现了许多提高数据质量的方法。在构建数据结构和收集数据的早期阶段实施这些措施将节省大量的时间、金钱和资源。例如,在公交路线项目中,没有记录公交车运营情况、每月司机人数等数据,也没有记录公交站点的信息,例如哪些站点有标记,哪些站点没有标记。即使存储了数据,也无法检索。我们大学的停车服务部门告诉我们,要从停车场获取历史数据,他们必须提出 5,000 多个手动请求。我们获得的所有数据都需要清理并转换为可用的形式。有时,从同一来源获得的数据不一致,如果一开始就多加注意,本可以节省大量工作。

In our work with our city’s fire department and department of public transportation, we discovered many ways to improve the quality of their data. Implementing those at the very early stages of building their data structures and collecting the data would save an enormous amount of time, money, and resources down the pipeline. For example, with the bus routing project, data like buses in operation and number of drivers by month is not recorded, and neither is information about bus stops, such as which ones are marked and which are unmarked. Even when data is stored, it was impossible to retrieve. Our university’s parking services informed us that to get historical data from the parking decks, they would have to make over 5,000 manual requests. All the data we obtained needed to be cleaned and transformed into a usable form. Sometimes, data obtained from the same source was inconsistent, and a lot of work could have been saved had more care been taken at the onset.

我们的数据还发生了一些值得终身学习的事情。在我们的项目后期,在我们清理、连接和转换所有相关数据之后,我们的模型产生了可转移到业务决策的结果,例如确定某些领域的供需差距,并突出显示最重要的贡献者,等等,我们发现所有给我们的公交车站数据都是乱码的。这意味着城市中每个公交车站的客流量和路线与数据表中的公交车站不对应,除了对数据库运行原始查询并跟踪发生了什么之外,我们没有办法修复它写入数据文件时出错。如果我们没有发现这一点,我们所有的分析都会基于错误的数据、垃圾数据!交通部门会根据错误的结果采取行动。我们必须始终确保我们使用的数据与实地数据准确对应。我们必须绘制、绘制地图、检查、仔细检查、三重检查。我们的工作伴随着责任,我们不能掉以轻心。我们应该对我们的数据和模型了如指掌。我们应该准备好回答有关我们模型的所有问题,将它们与现有的其他模型进行比较,并确保在将结果提供给利益相关者之前我们进行了尽职调查。

Something else happened with our data that is a lifelong lesson. Late into our project, after we cleaned, joined, and transformed all the relevant data, and our models were producing results that are transferable to business decisions, such as identifying gaps between supply and demand in certain areas, and highlighting the most significant contributors, etc., we discovered that all the bus stop data that we were given was scrambled. This meant that the ridership and route for each bus stop in the city did not correspond to the bus stop that was in the data table, and we had no way to fix it other than running the original query to the database and tracking down what went wrong when writing the datafile. Had we not discovered that, we would have based all our analysis on wrong data, garbage data! The transportation department would have acted on wrong results. We must always make sure that the data we work with corresponds accurately with what’s on the ground. We must plot, map, check, double-check, and triple-check. There is a responsibility that comes with our work, and we cannot take it lightly. We should know our data and our models inside and out. We should be prepared to answer all questions about our models, compare them to other models that are out there, and make sure we did our due diligence before giving our results to the stakeholders.

像我们一样,通用人工智能代理会在正确的位置寻找正确的数据,然后将其转换为可用的形式。在那之前,我们必须重新集中精力收集和存储高质量的数据,并拥有更好的方法来访问和查询数据。由于低质量的数据和不存在的数字基础设施,许多人工智能项目从未见到曙光,许多自动化投资也从未看到任何回报。我们应该退后一步,思考数据最终将如何表示为我们模型的输入。这应该指导我们如何获取数据以及如何存储数据以供将来使用。人工智能领域已经遵循了应该采用的范式普遍来说:表现第一,获取第二

Like us, a general AI agent would look for the right data in the right places, then transform it into usable form. Until then, we must refocus our efforts on collecting and storing good quality data and having better ways to access and query it. Because of low quality data and nonexistent digital infrastructures, many AI projects never see the light of the day, and many automation investments never see any returns. We should step back and think about how data will end up being represented as inputs for our models. This is what should guide how we acquire data, and how we can store it for future use. The AI field has operated on a paradigm that should be adopted universally: representation first, acquisition second.

区分偏见和歧视

Distinguishing Bias from Discrimination

很多涉及人工智能伦理的讨论可以互换使用偏见歧视这两个术语,我想确保在完成本书之前强调两者之间的区别。我从来都不是一个纠结于术语定义的人,尤其是因为我把英语作为第三语言,并且因为我注意到重新定义术语经常被用作一种廉价的策略来偏离论证或辩论的要点。我想特别强调偏见和歧视之间的区别的原因是,两者都需要不同的数学来识别。而且,一个是有意的,一个是无意的。我们和我们的机器都应该能够推理出哪个是哪个。

A lot of discussions that involve AI ethics use the terms bias and discrimination interchangeably, and I wanted to make sure that we highlight the difference between the two before we finish the book. I was never a person to be hung up on definitions of terms, especially since I speak English as a third language, and because I notice that redefining terms is often used as a cheap tactic to deflect from the main points of an argument or a debate. The reason I want to highlight the difference between bias and discrimination in particular is that each requires different mathematics to identify. Moreover, one is intentional, and the other is not. Both we and our machines should be able to reason about which one is which.

简而言之,我们仅通过观察数据就可以发现偏差。除非我们从单纯的观察上升到更高层次的推理,使用干预和反事实的因果语言,否则我们无法识别歧视,我们在第 11 章中对此进行了讨论:如果我更改申请人简历中的性别,他们会怎样?得到这份工作了吗?

In a nutshell, we can detect bias merely by observing the data. We cannot identify discrimination unless we ascend from mere observations to a higher level of reasoning, using the causal language of interventions and counterfactuals, which we went over in Chapter 11: what if I change the gender of the applicant on their résumé, would they have gotten the job?

偏见是特定决定与申请人的特定性别之间的关联模式。当观察申请人和最终聘用的数据时,我们可以直接检测到这种模式。

Bias is a pattern of association between a particular decision and a particular sex of applicant. We can detect this pattern directly when observing the data of applicants and eventual hires.

另一方面,歧视则具有故意性:它是在对申请人的性别影响不大的情况下做出的决定,而这对于进入资格来说并不重要。申请人的性别影响了聘用决定。

Discrimination, on the other hand, has intentionality in it: it is the exercise of decision influenced by the sex of the applicant when that is immaterial to the qualifications for entry. The gender of the applicant affected the hiring decision.

Judea Pearl 的《The Book of Why》中重点介绍了这些定义。他接着提到了美国判例法中歧视的定义,该定义也使用了反事实的语言:

These definitions are highlighted in Judea Pearl’s The Book of Why. He goes on to mention the definition of discrimination in US case law, which also uses the language of counterfactuals:

在卡森诉伯利恒钢铁公司(Carson v. Bethlehem Steel Corp.,1996)一案中,第七巡回法院写道:“任何就业歧视案件的核心问题是,如果雇员属于不同种族(年龄、性别),雇主是否会采取相同的行动? 、宗教、国籍等)和其他一切都是一样的。

In Carson v. Bethlehem Steel Corp. (1996), the Seventh Circuit Court wrote, “The central question in any employment-discrimination case is whether the employer would have taken the same action had the employee been of a different race (age, sex, religion, national origin, etc.) and everything else had been the same.

因此,为了区分偏见和故意歧视,我们需要使用条件概率的积分,我们在第 9 章和10章中介绍了这一点,您可以从 Judea Pearl 及其数学家的优秀资源中了解更多信息。社区。

Therefore, to distinguish bias from intentional discrimination, we need to use the do calculus on conditional probabilities, which we introduced in Chapters 9 and 10, and which you can learn more about from the excellent resources by Judea Pearl and his mathematical community.

炒作

The Hype

人工智能领域在其整个历史中一直被指责为被炒作。如今,任何解决问题或构建系统的计算方法,无论是传统的还是最新的,都正在被重新定义为人工智能。传统统计是人工智能,运筹学是人工智能,数据探索和分析是人工智能,量子计算是人工智能,医学成像是人工智能等等。许多初创公司依赖于夸大的指标、夸大的事实以及不投入太多资金的投资者问题,以免错过下一件大事(例如破产的硅谷血液检测公司 Theranos)。由于我们正处于人工智能已成为流行语和家喻户晓的术语的时代,人们很容易认为任何基于人工智能的技术都会发挥作用。

The AI field has been accused of being hyped up throughout its history. Nowadays, any computational approach to solving problems or building systems, whether traditional or more recent, is being reframed as AI. Traditional statistics is AI, operations research is AI, data exploration and analysis is AI, quantum computing is AI, medical imaging is AI, etc. Many start-up companies are relying on inflated metrics, stretched truths, and investors who chip in without much question so as not to miss out on the next big thing (such as the busted Silicon Valley blood testing company Theranos). Since we are at an age where AI has become a buzzword and household term, it is easy to get swept away thinking that any technology based on AI is going to work.

量子计算是另一项仍处于起步阶段的技术,被大肆宣传并与人工智能混为一谈。它距离商业化还很远,但已经在进行营销。需要进行大量研究,如果成功,该技术将具有巨大的有用应用潜力。最著名的应用是彼得·肖尔 (Peter Shor) 1994 年的理论演示,该应用引起了相当多的研究经费和政府的关注,该演示表明量子计算机可以比所有经典方案更快地解决寻找大数素因数的难题。Rivest–Shamir–Adleman (RSA) 加密是现代计算机用来加密和解密消息的算法,而质因数分解是破解其代码的核心。

Quantum computing is another technology that is still in its infancy, being hyped and conflated with AI. It is nowhere close to being commercialized, but is already being marketed at such. A lot of research needs to be done, and if successful, the technology has a large potential for useful applications. The most famous application, which spurred considerable research funding and government attention, is Peter Shor’s 1994 theoretical demonstration that a quantum computer can solve the hard problem of finding the prime factors of large numbers exponentially faster than all classical schemes. The Rivest–Shamir–Adleman (RSA) encryption is an algorithm used by modern computers to encrypt and decrypt messages, and prime factorization is at the core of breaking its code.

与量子计算相比,专业人工智能已经非常成熟,本书的目标之一就是辨别炒作与非炒作。无论是否大肆宣传,进入该领域,享受并努力实现良好的目标并解锁伟大的目标潜力。

Specialized AI is well developed compared to quantum computing, and one of the goals of this book is to discern the hype from the nonhype. Hyped or not, get in the field, enjoy, and work toward good goals and unlocking great potentials.

最后的想法

Final Thoughts

许多部门和行业都被人工智能和数据科学所吸引。他们希望利用计算能力的实质性进展和高度表达模型的进步,将数据转化为有意义的见解和决策。他们还意识到行业层面发生巨变的潜力,他们希望成为其中的一部分。

Many sectors and industries are gravitating toward AI and data science. They want to leverage the substantive progress in the computational abilities and the advancement of highly expressive models transforming data into meaningful insights and decisions. They also realize the potential for a sea change at an industry level, and they want to be part of that.

如果你想进入这个美丽而令人兴奋的领域,你可以进入它的应用领域。在您感兴趣且充满热情的行业中选择一个应用程序。首先提出您想要回答的问题,查找数据,然后开始应用您所学到的知识。另一条途径是进入研究领域,我们研究模型本身,如何改进它们、扩展它们、分析它们,并证明关于它们行为的定理,或者提出全新的模型。再次强调,只选择你真正好奇的研究项目。另一种途径是进行编码,构建包、库和更好的实现。这样你就会帮我们大家一个忙。我无法想象如果 Keras 和 scikit-learn(用于机器学习和神经网络的 Python 库)不存在,我们很多人会做什么。

If you want to get into this beautiful and exciting field, you can go into the applied side of it. Choose an application in an industry that interests you and that you feel passionate about. Start by formulating questions that you want to answer, find data, and start applying what you learned. Another path is to go into the research side, where we study the models themselves, how to improve them, scale them, analyze them, and prove theorems about their behavior, or come up with entirely new ones. Again, only choose research projects that you are genuinely curious about. One more path is to go to the coding side of things, building packages, libraries, and better implementations. You will be doing us all a favor that way. I cannot imagine what many of us would do if Keras and scikit-learn (Python libraries for machine learning and neural networks) did not exist.

目前,全球拥有博士学位的人工智能研究人员仅有2.2万名,其中40%在美国。为了满足需求并将新想法带入该领域,我们需要更多的国内外研究人员。我希望这本书能够让您快速进入这个迷人的领域,并且我希望您现在有足够的基础,能够自己扩展到您感兴趣的任何主题。

Currently, there are only 22,000 Ph.D.-holding AI researchers in the world, 40% of which are in the United States. To fulfill demand and bring new ideas into the field, we need many more researchers, both domestically and internationally. I hope this book was able to fast-track you into this fascinating field, and I hope you now have enough of a foundation to be able to branch out on your own into any of the topics that interest you.

作为一个一直欣赏数学及其对宇宙建模的惊人能力的人,对我来说最令人兴奋的事情之一是人工智能点燃了人们对数学的兴趣。我希望这反过来能促使数学家重新思考如何展示和教授数学。与此同时,让我们所有人都倡导高质量和准确标记的数据、人工智能政策,并诚实地对待我们的系统所考虑和不考虑的内容。与此同时,我们必须非常小心,不要将人类经验简化为一系列数据和指标,其中一些是可测量的,而另一些则留给我们易出错的模型来预测和制定决策。正如本书一再展示的那样,经历、点击习惯、邮政编码、健康记录、社交媒体评论、图像、标签、电子邮件通信、居住史、种族、民族、国籍、宗教、婚姻状况、年龄、我们的朋友、我们朋友的习惯等等,想方设法成为高维向量中的条目,然后将其输入机器学习模型以进行预测。我们希望确保我们不会意外地将自己转变为行走和说话的高维数据点。

One of the most exciting things for me, as someone who has always appreciated math and its astounding ability to model our universe, is that AI has ignited people’s interest in math. I hope this in return drives mathematicians to rethink how to present and teach math. Meanwhile, let’s all advocate for high quality and accurately labeled data, AI policy, and great honesty about what our systems account and do not account for. At the same time, we have to be very careful not to reduce our human experience to a stream of data and indicators, some measurable and others left to our fallible models to predict and base decisions on. As this book demonstrated again and again, experiences, click habits, zip codes, health records, comments on social media, images, tags, email correspondence, residence history, race, ethnicity, national origin, religion, marital status, age, our friends, our friends’ habits, etc., all find their way to becoming mere entries in a high-dimensional vector that gets fed to a machine learning model to make predictions. We want to make sure that we are not accidentally transforming ourselves into walking and talking high-dimensional data points.

咱们离开吧科幻小说对人工智能伦理的贡献:“机器人三定律”,由艾萨克·阿西莫夫在 1942 年的短篇小说《Runaround》中撰写。法律是:

Let’s leave with science fiction’s contribution to the ethics of AI: the “Three Laws of Robotics”, written by Isaac Asimov in his 1942 short story Runaround. The laws are:

  1. 机器人不得伤害人类,或因不作为而允许人类受到伤害。

  2. A robot may not injure a human being or, through inaction, allow a human being to come to harm.

  3. 机器人必须服从人类发出的命令,除非这些命令与第一定律相冲突。

  4. A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.

  5. 机器人必须保护自己的存在,只要这种保护不违反第一或第二定律。

  6. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

目前我的最终想法是:人工智能将数学的许多方面巧妙地结合在一起。也许这不是巧合。也许数学是适合智力的语言,而智力通过数学最舒服地表达自己。为了人工复制智能,我们需要一个能够通过其首选语言轻松代表世界的代理。

My final thought, for now: AI has tied many aspects of mathematics neatly together. Maybe this is not a coincidence. Maybe mathematics is the language that fits intelligence, and intelligence expresses itself most comfortably through mathematics. For intelligence to be artificially replicated, we need an agent that can represent the world, effortlessly, through its preferred language.

指数

Index

A

A

B

C

C

D

D

E

F

F

G

G

H

H

I

I

J

J

K

K

L

L

中号

M

N

O

O

P

Q

R

S

S

T

T

U

U

V

V

W

关于作者

About the Author

哈拉·尼尔森是詹姆斯·麦迪逊大学数学副教授。她拥有博士学位。纽约大学库朗数学科学研究所数学博士。在进入詹姆斯麦迪逊大学之前,她是密歇根大学安娜堡分校的博士后助理教授。

Hala Nelson is an associate professor of mathematics at James Madison University. She has a Ph.D. in mathematics from the Courant Institute of Mathematical Sciences at New York University. Prior to James Madison University, she was a postdoctoral assistant professor at the University of Michigan, Ann Arbor.

她专注于数学建模,并为公共部门的应急和基础设施服务提供咨询。她喜欢将复杂的想法转化为简单实用的术语。对她来说,大多数数学概念都是无痛且相关的,除非提出这些概念的人要么不太理解它们,要么试图炫耀。

She specializes in mathematical modeling and consults for emergency and infrastructure services in the public sector. She likes to translate complex ideas into simple and practical terms. To her, most mathematical concepts are painless and relatable, unless the person presenting them either does not understand them very well, or is trying to show off.

其他事实:哈拉·纳尔逊在黎巴嫩残酷的内战期间长大。她很小的时候就在一次导弹爆炸中失去了头发。这次事件以及随后发生的许多事件塑造了她对人类行为、智能本质和人工智能的兴趣。她的父亲在家用法语教她数学,直到她高中毕业。她父亲最喜欢的关于数学的一句话是:“这是一门干净的科学。”

Other facts: Hala Nelson grew up in Lebanon during its brutal civil war. She lost her hair at a very young age in a missile explosion. This event, and many that followed, shaped her interests in human behavior, the nature of intelligence, and AI. Her dad taught her math, at home and in French, until she graduated high school. Her favorite quote from her dad about math is, “It is the one clean science.”

版画

Colophon

《人工智能基础数学》封面上的动物是一只驾驭羚羊(Tragelaphus scriptus scriptus),这是一种遍布撒哈拉以南非洲地区的羚羊。这些动物生活在多种栖息地,如林地、稀树草原和雨林。挽具羚羊因其背部和侧面的白色条纹和斑点图案而得名,类似于马鞍或挽具。这些白色斑块也出现在动物的脖子、耳朵和下巴上。

The animal on the cover of Essential Math for AI is a harnessed bushbuck (Tragelaphus scriptus scriptus), an antelope found throughout sub-Saharan Africa. The animals live in many types of habitat, such as woodland, savanna, and rainforest. The harnessed bushbuck is named for a pattern of white stripes and spots along its back and flanks that resembles a saddle or harness. These white patches also appear on the animal’s neck, ears, and chin.

挽具羚羊是八个羚羊亚种中体型最小的,站立时肩高约 30 英寸,体重 70-100 磅。它的皮毛是红棕色的,尽管雌性的颜色往往较浅,并且有更明显的白色斑纹。雄性羚羊也有角,大约在 10 个月大时出现,最终形成单一的扭曲。羚羊以树木和灌木的叶子以及开花植物为食——它们吃草的情况并不常见。

The harnessed bushbuck is the smallest of eight bushbuck subspecies, generally standing about 30 inches tall at the shoulder and weighing 70–100 pounds. Its coat is reddish-brown, though females tend to be lighter in color and have more conspicuous white markings. Male bushbucks also sport horns, which appear around the age of 10 months and eventually develop a single twist. Bushbucks graze on the leaves of trees and shrubs, as well as flowering plants—it is uncommon for them to eat grass.

羚羊在白天最活跃,在规定的领地内过着独居的生活。然而,虽然它们不聚集在一起,但这些动物也没有过度攻击性。雄性羚羊的角可用于交配展示、雌性发情时驱赶竞争对手以及罕见的领土争端,但成年羚羊往往会避免彼此接触。雌性羚羊一次产下一只小羚羊,在小羚羊出生后,它们会非常小心地将其隐藏起来,只去探望它吃奶。母亲还吃小牛的粪便,这样捕食者就不会被吸引到该地区。大约四个月后,小牛开始陪妈妈吃草、玩耍。

The bushbuck is most active during the day and lives a solitary life within a defined territory. However, while they don’t gather in groups, neither are these animals overly aggressive. The male’s horns can be used in mating displays, to drive away competitors when a female is in heat, and for the rare territorial dispute, but adult bushbuck tend to avoid contact with each other. Female bushbucks bear one calf at a time, and hide the young one very carefully after birth, only visiting it to nurse. The mother also eats the calf’s dung so predators are not drawn to the area. After about four months, the calf begins to accompany its mother to graze and play.

尽管羚羊受到栖息地丧失的影响,并因其肉和皮而被猎杀,但它们分布广泛,并被世界自然保护联盟列为最不受关注的物种。奥莱利封面上的许多动物都濒临灭绝。所有这些对世界都很重要。

Though bushbucks are affected by habitat loss and are hunted for their meat and hides, they are widespread and classified as Least Concern by the IUCN. Many of the animals on O’Reilly covers are endangered; all of them are important to the world.

封面插图由凯伦·蒙哥马利 (Karen Montgomery) 创作,以Shaw's Zoology中的古董线条雕刻为基础。封面字体为 Gilroy Semibold 和 Guardian Sans。文字字体为Adobe Minion Pro;标题字体为 Adob​​e Myriad Condensed;代码字体是Dalton Maag的Ubuntu Mono。

The cover illustration is by Karen Montgomery, based on an antique line engraving from Shaw’s Zoology. The cover fonts are Gilroy Semibold and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.